Model runs on A10, but not H100

bruceab · June 8, 2023, 9:36pm

I have a simple model that runs fine on a 1x A10 (24 GB PCIe) instance, but fails to run on a 1x H100 (80 GB PCIe) instance. The error I get on the H100 is the following:

Attempting to perform BLAS operation using StreamExecutor without BLAS support
[[{{node sequential/dense/MatMul}}]] [Op:__inference_train_function_1166]

All help is appreciated. My code is below.

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1, 28, 28, 1) / 255.0
x_test = x_test.reshape(-1, 28, 28, 1) / 255.0
y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)
y_test = tf.keras.utils.to_categorical(y_test, num_classes=10)

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10, batch_size=128, validation_data=(x_test, y_test))

markd · June 13, 2023, 11:48pm

I hacked this a little bit, testing the normal tensorflow design issues and workarounds.

I ended up:
$ TF_FORCE_GPU_ALLOW_GROWTH=‘true’ python test.py

And added the following to the code:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

Topic		Replies	Views
Cannot Run Falcon40B on H100 Technical Help	5	2993	July 13, 2023
Problem with testing full capabilities of models on cloud H100. Technical Help	0	33	October 27, 2024
H100 machine performance Technical Help	0	133	October 12, 2024
Keras fails after upgrade Technical Help	2	2389	November 13, 2020
Warnings on H100 instance running PyTorch	1	1555	July 4, 2023

Model runs on A10, but not H100

Related topics