Hi, has anyone seen this error in the last 3 months or so with running a compiled pytorch model on a multi-GPU Lambda instance?
[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/variables/builder.py", line 529, in _wrap
[rank0]: if has_triton():
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 37, in has_triton
[rank0]: return is_device_compatible_with_triton() and has_triton_package()
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 33, in is_device_compatible_with_triton
[rank0]: if device_interface.is_available() and extra_check(device_interface):
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 23, in cuda_extra_check
[rank0]: return device_interface.Worker.get_device_properties().major >= 7
[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/device_interface.py", line 191, in get_device_properties
[rank0]: return caching_worker_device_properties["cuda"][device]
[rank0]: torch._dynamo.exc.InternalTorchDynamoError: IndexError: list index out of range
[rank0]: from user code:
[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/external_utils.py", line 40, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
I got this last night trying to run my code on a 8x A100 (40 GB SXM4) instance. I opened a support ticket and a support engineer (thanks Quinn!) tried and reproduced the issue on the same type of instance in two different regions. However, we still don’t know what caused it. It seems strictly internal (some kind of device property check) and my exact same code was running fine last October-November on the same type of instance and last week on my local TensorBook.
I don’t have a minified repo yet but here is what we ran:
pip3 install --upgrade requests
pip3 install wandb
pip3 install schedulefree
git clone https://github.com/EIFY/mup-vit.git
cd mup-vit
NUMEXPR_MAX_THREADS=116 torchrun main.py --fake-data --batch-size 4096 --log-steps 100