I’m trying to use a PyTorch based container.
The nvidia runtime wasn’t automatically available.
I downloaded and manually installed the NVIDIA container runtime on my H100 instance.
I then rebooted, and started the container with --runtime-nvidia
However, I get a complaint that PyTorch doesn’t work on sm_90 hardware.
Which seems weird – wouldn’t it be backwards compatble? What am I missing?
+ docker run --runtime=nvidia --rm -it --name train -v /mpt:/mpt -v /mpt/cache:/root/.cache train composer train.py observe-help.yaml
==========
== CUDA ==
==========
CUDA Version 11.7.1
Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
/usr/lib/python3/dist-packages/torch/cuda/__init__.py:155: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))