I have tried the new H100 and have to admit my expectations were higher, which however may be caused by old code having bottlenecks other than GPU. But this is curious:
Epoch 0 | Training | Elapsed Time: 3:26:44 | Steps: 4013 | Loss: 58.975423
Epoch 0 | Validation | Elapsed Time: 0:03:45 | Steps: 67 | Loss: 43.107231 | Dataset: /models/csv/dev.csv
--------------------------------------------------------------------------------
Epoch 1 | Training | Elapsed Time: 2:58:03 | Steps: 4013 | Loss: 47.359938
Epoch 1 | Validation | Elapsed Time: 0:03:27 | Steps: 67 | Loss: 36.239220 | Dataset: /models/csv/dev.csv
--------------------------------------------------------------------------------
Epoch 2 | Training | Elapsed Time: 2:18:15 | Steps: 4013 | Loss: 41.412868
Epoch 2 | Validation | Elapsed Time: 0:01:03 | Steps: 67 | Loss: 32.271228 | Dataset: /models/csv/dev.csv
--------------------------------------------------------------------------------
Epoch 3 | Training | Elapsed Time: 2:03:08 | Steps: 4013 | Loss: 37.775420
Epoch 3 | Validation | Elapsed Time: 0:01:04 | Steps: 67 | Loss: 29.865935 | Dataset: /models/csv/dev.csv
--------------------------------------------------------------------------------
Epoch 4 | Training | Elapsed Time: 2:03:36 | Steps: 4013 | Loss: 35.070321
Epoch 4 | Validation | Elapsed Time: 0:01:09 | Steps: 67 | Loss: 27.876995 | Dataset: /models/csv/dev.csv
--------------------------------------------------------------------------------
Epoch 5 | Training | Elapsed Time: 2:03:30 | Steps: 4013 | Loss: 33.242597
Epoch 5 | Validation | Elapsed Time: 0:01:05 | Steps: 67 | Loss: 26.488565 | Dataset: /models/csv/dev.csv
--------------------------------------------------------------------------------
Epoch 6 | Training | Elapsed Time: 2:03:34 | Steps: 4013 | Loss: 31.552849
Epoch 6 | Validation | Elapsed Time: 0:01:06 | Steps: 67 | Loss: 25.423725 | Dataset: /models/csv/dev.csv
--------------------------------------------------------------------------------
Epoch 7 | Training | Elapsed Time: 2:24:47 | Steps: 4013 | Loss: 30.260191
Epoch 7 | Validation | Elapsed Time: 0:03:40 | Steps: 67 | Loss: 24.497889 | Dataset: /models/csv/dev.csv
--------------------------------------------------------------------------------
Epoch 8 | Training | Elapsed Time: 2:53:19 | Steps: 4013 | Loss: 29.293678
Epoch 8 | Validation | Elapsed Time: 0:02:53 | Steps: 67 | Loss: 23.781718 | Dataset: /models/csv/dev.csv
--------------------------------------------------------------------------------
Epoch 9 | Training | Elapsed Time: 2:47:53 | Steps: 4013 | Loss: 28.336873
Epoch 9 | Validation | Elapsed Time: 0:02:43 | Steps: 67 | Loss: 23.187069 | Dataset: /models/csv/dev.csv
--------------------------------------------------------------------------------
I wonder what could have caused the increased epoch times. They are not precisely 24h apart, although they roughly correspond to 10 AM to 3 PM in Utah. Nothing else was running on the server at the time apart from sometimes ssh
and top
and at the start watch -n 10 nvidia-smi
.