Bizarre InternalTorchDynamoError with locally and formerly working code

Hi, has anyone seen this error in the last 3 months or so with running a compiled pytorch model on a multi-GPU Lambda instance?

[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/variables/builder.py", line 529, in _wrap
[rank0]: if has_triton():
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 37, in has_triton
[rank0]: return is_device_compatible_with_triton() and has_triton_package()
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 33, in is_device_compatible_with_triton
[rank0]: if device_interface.is_available() and extra_check(device_interface):
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 23, in cuda_extra_check
[rank0]: return device_interface.Worker.get_device_properties().major >= 7
[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/device_interface.py", line 191, in get_device_properties
[rank0]: return caching_worker_device_properties["cuda"][device]
[rank0]: torch._dynamo.exc.InternalTorchDynamoError: IndexError: list index out of range

[rank0]: from user code:
[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/external_utils.py", line 40, in inner
[rank0]: return fn(*args, **kwargs)

[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

I got this last night trying to run my code on a 8x A100 (40 GB SXM4) instance. I opened a support ticket and a support engineer (thanks Quinn!) tried and reproduced the issue on the same type of instance in two different regions. However, we still don’t know what caused it. It seems strictly internal (some kind of device property check) and my exact same code was running fine last October-November on the same type of instance and last week on my local TensorBook.

I don’t have a minified repo yet but here is what we ran:

pip3 install --upgrade requests
pip3 install wandb
pip3 install schedulefree

git clone https://github.com/EIFY/mup-vit.git
cd mup-vit
NUMEXPR_MAX_THREADS=116 torchrun main.py --fake-data --batch-size 4096 --log-steps 100

Out there on the interweb this seems the closest: [Bug]: Cannot Load any model. IndexError with CUDA, multiple GPUs · Issue #4069 · vllm-project/vllm · GitHub
The author reported that “Solved it with a fresh install with a new docker container” but that’s not an option for me…

Update: It turned out that I just forgot the --multiprocessing-distributed flag, which is critical for multi-GPU instance. See Bizarre InternalTorchDynamoError with locally and formerly working code - #5 by ptrblck - PyTorch Forums for details.