Bizarre InternalTorchDynamoError with locally and formerly working code

EIFY · February 7, 2025, 8:17pm

Hi, has anyone seen this error in the last 3 months or so with running a compiled pytorch model on a multi-GPU Lambda instance?

[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/variables/builder.py", line 529, in _wrap
[rank0]: if has_triton():
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 37, in has_triton
[rank0]: return is_device_compatible_with_triton() and has_triton_package()
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 33, in is_device_compatible_with_triton
[rank0]: if device_interface.is_available() and extra_check(device_interface):
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 23, in cuda_extra_check
[rank0]: return device_interface.Worker.get_device_properties().major >= 7
[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/device_interface.py", line 191, in get_device_properties
[rank0]: return caching_worker_device_properties["cuda"][device]
[rank0]: torch._dynamo.exc.InternalTorchDynamoError: IndexError: list index out of range

[rank0]: from user code:
[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/external_utils.py", line 40, in inner
[rank0]: return fn(*args, **kwargs)

[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

I got this last night trying to run my code on a 8x A100 (40 GB SXM4) instance. I opened a support ticket and a support engineer (thanks Quinn!) tried and reproduced the issue on the same type of instance in two different regions. However, we still don’t know what caused it. It seems strictly internal (some kind of device property check) and my exact same code was running fine last October-November on the same type of instance and last week on my local TensorBook.

I don’t have a minified repo yet but here is what we ran:

pip3 install --upgrade requests
pip3 install wandb
pip3 install schedulefree

git clone https://github.com/EIFY/mup-vit.git
cd mup-vit
NUMEXPR_MAX_THREADS=116 torchrun main.py --fake-data --batch-size 4096 --log-steps 100

EIFY · February 7, 2025, 8:20pm

Out there on the interweb this seems the closest: [Bug]: Cannot Load any model. IndexError with CUDA, multiple GPUs · Issue #4069 · vllm-project/vllm · GitHub
The author reported that “Solved it with a fresh install with a new docker container” but that’s not an option for me…

EIFY · February 8, 2025, 6:29am

Update: It turned out that I just forgot the --multiprocessing-distributed flag, which is critical for multi-GPU instance. See Bizarre InternalTorchDynamoError with locally and formerly working code - #5 by ptrblck - PyTorch Forums for details.

Topic		Replies	Views
Lambda stack has a pytorch/CUDA version incompatibility? Technical Help	4	2399	May 1, 2023
RuntimeError: cuDNN version mismatch: PyTorch was compiled against 7005 but linked against 7102 Technical Help	1	3995	May 15, 2018
Lambda workstation gpu not recognized Technical Help	1	1545	March 4, 2022
Updated Lambda Stack and now have a PyTorch CuDNN version mismatch Error [resolved] Technical Help	5	3769	December 26, 2018
PyTorch 1.7.1 is incompatible with CUDA 11.1 which installed by lambda stack Technical Help	1	2667	February 5, 2021

Bizarre InternalTorchDynamoError with locally and formerly working code

Related topics