I’m trying to train Microsoft’s mattergen model using eight A100 GPUs on the Lambda Cloud. The instance type I used is gpu_8x_a100_80gb_sxm4
and region is us-east-1
. The Python version on the instance is 3.10.12
. Here are the details of the instance from the fastfetch tool:
.... ubuntu@xxx-xxx-xx-xxx
.',:clooo: .:looooo:. ---------------------
.;looooooooc .oooooooooo' OS: Ubuntu 22.04.5 LTS x86_64
.;looooool:,''. :ooooooooooc Host: KVM/QEMU Standard PC (Q35 + ICH9, 2009) (pc-q35-8.0)
;looool;. 'oooooooooo, Kernel: Linux 6.8.0-52-generic
;clool' .cooooooc. ,, Uptime: 1 hour, 53 mins
... ...... .:oo, Packages: 1243 (dpkg), 6 (snap)
.;clol:,. .loooo' Shell: bash 5.1.16
:ooooooooo, 'ooool Terminal: /dev/pts/0
'ooooooooooo. loooo. CPU: 2 x AMD EPYC 7J13 64-Core (240) @ 2.45 GHz
'ooooooooool coooo. GPU 1: NVIDIA A100 SXM4 80GB
,loooooooc. .loooo. GPU 2: NVIDIA A100 SXM4 80GB
.,;;;'. ;ooooc GPU 3: RedHat Virtio GPU
... ,ooool. GPU 4: NVIDIA A100 SXM4 80GB
.cooooc. ..',,'. .cooo. GPU 5: NVIDIA A100 SXM4 80GB
;ooooo:. ;oooooooc. :l. GPU 6: NVIDIA A100 SXM4 80GB
.coooooc,.. coooooooooo. GPU 7: NVIDIA A100 SXM4 80GB
.:ooooooolc:. .ooooooooooo' GPU 8: NVIDIA A100 SXM4 80GB
.':loooooo; ,oooooooooc GPU 9: NVIDIA A100 SXM4 80GB
..';::c' .;loooo:' Memory: 21.51 GiB / 1.73 TiB (1%)
Swap: Disabled
Disk (/): 32.86 GiB / 18.93 TiB (0%) - ext4
Local IP (eno1): 10.19.89.112/20
Locale: C.UTF-8
I had to install Git LFS] to train the MatterGen model. I used the commands shown below to install Git LFS on the instance.
sudo apt install git-lfs
git lfs install
I also installed uv which is suggested in the MatterGen docs to setup the Python environment.
curl -LsSf https://astral.sh/uv/install.sh | sh
Next, I cloned the mattergen repository to the instance and followed the suggested steps for installation:
git clone https://github.com/microsoft/mattergen.git
cd mattergen
pip install uv
uv venv .venv --python 3.10
source .venv/bin/activate
uv pip install -e .
I preprocessed the MP-20 dataset for training using the commands shown below.
git lfs pull -I data-release/mp-20/ --exclude=""
unzip data-release/mp-20/mp_20.zip -d datasets
csv-to-dataset --csv-folder datasets/mp_20/ --dataset-name mp_20 --cache-folder datasets/cache
After processing the dataset, I tried to train the model using the command shown below.
mattergen-train \
data_module=mp_20 \
~trainer.logger \
trainer.devices=8
On the Lambda Cloud instance, none of the GPUs are utilized during the training which I checked with nvidia-smi
. The training uses 100% of the CPUs which I verified with htop
. I tried this same procedure on a different system and mattergen was able to find and utilize the GPUs.
So is there something else I need to do for installing and training MatterGen on a Lambda Cloud instance? I know there is a Lambda Stack installed on the instances, but I don’t know how to install mattergen within this stack.