Training Microsoft's MatterGen model doesn't use GPUs

I’m trying to train Microsoft’s mattergen model using eight A100 GPUs on the Lambda Cloud. The instance type I used is gpu_8x_a100_80gb_sxm4 and region is us-east-1. The Python version on the instance is 3.10.12. Here are the details of the instance from the fastfetch tool:

                             ....              ubuntu@xxx-xxx-xx-xxx
              .',:clooo:  .:looooo:.           ---------------------
           .;looooooooc  .oooooooooo'          OS: Ubuntu 22.04.5 LTS x86_64
        .;looooool:,''.  :ooooooooooc          Host: KVM/QEMU Standard PC (Q35 + ICH9, 2009) (pc-q35-8.0)
       ;looool;.         'oooooooooo,          Kernel: Linux 6.8.0-52-generic
      ;clool'             .cooooooc.  ,,       Uptime: 1 hour, 53 mins
         ...                ......  .:oo,      Packages: 1243 (dpkg), 6 (snap)
  .;clol:,.                        .loooo'     Shell: bash 5.1.16
 :ooooooooo,                        'ooool     Terminal: /dev/pts/0
'ooooooooooo.                        loooo.    CPU: 2 x AMD EPYC 7J13 64-Core (240) @ 2.45 GHz
'ooooooooool                         coooo.    GPU 1: NVIDIA A100 SXM4 80GB
 ,loooooooc.                        .loooo.    GPU 2: NVIDIA A100 SXM4 80GB
   .,;;;'.                          ;ooooc     GPU 3: RedHat Virtio GPU
       ...                         ,ooool.     GPU 4: NVIDIA A100 SXM4 80GB
    .cooooc.              ..',,'.  .cooo.      GPU 5: NVIDIA A100 SXM4 80GB
      ;ooooo:.           ;oooooooc.  :l.       GPU 6: NVIDIA A100 SXM4 80GB
       .coooooc,..      coooooooooo.           GPU 7: NVIDIA A100 SXM4 80GB
         .:ooooooolc:. .ooooooooooo'           GPU 8: NVIDIA A100 SXM4 80GB
           .':loooooo;  ,oooooooooc            GPU 9: NVIDIA A100 SXM4 80GB
               ..';::c'  .;loooo:'             Memory: 21.51 GiB / 1.73 TiB (1%)
                                               Swap: Disabled
                                               Disk (/): 32.86 GiB / 18.93 TiB (0%) - ext4
                                               Local IP (eno1): 10.19.89.112/20
                                               Locale: C.UTF-8

I had to install Git LFS] to train the MatterGen model. I used the commands shown below to install Git LFS on the instance.

sudo apt install git-lfs

git lfs install

I also installed uv which is suggested in the MatterGen docs to setup the Python environment.

curl -LsSf https://astral.sh/uv/install.sh | sh

Next, I cloned the mattergen repository to the instance and followed the suggested steps for installation:

git clone https://github.com/microsoft/mattergen.git

cd mattergen

pip install uv

uv venv .venv --python 3.10 

source .venv/bin/activate

uv pip install -e .

I preprocessed the MP-20 dataset for training using the commands shown below.

git lfs pull -I data-release/mp-20/ --exclude=""

unzip data-release/mp-20/mp_20.zip -d datasets

csv-to-dataset --csv-folder datasets/mp_20/ --dataset-name mp_20 --cache-folder datasets/cache

After processing the dataset, I tried to train the model using the command shown below.

mattergen-train \
    data_module=mp_20 \
    ~trainer.logger \
    trainer.devices=8

On the Lambda Cloud instance, none of the GPUs are utilized during the training which I checked with nvidia-smi. The training uses 100% of the CPUs which I verified with htop. I tried this same procedure on a different system and mattergen was able to find and utilize the GPUs.

So is there something else I need to do for installing and training MatterGen on a Lambda Cloud instance? I know there is a Lambda Stack installed on the instances, but I don’t know how to install mattergen within this stack.

Sounds like the application or process you’re trying to run isn’t correctly seeing the gpu drivers installed. It took a lot of fiddling for me to get the rust program I was using to correctly use openGL with the nvidia hardware and drivers.

It’s working now. Apparently the code takes a while to initialize the GPUs and start the training.