Cannot Run Falcon40B on H100

I am new to LambdaLabs and recently launched a H100 instance. I tried to run a script that tests the Falcon40B Instruct model but I get an error message when trying python test.py. Any help would be appreciated.

Terminal:

python test.py
2023-06-26 21:30:50.884700: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-26 21:30:51.090124: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            209-20-157-85
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           209-20-157-85
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: 209-20-157-85
  Location: mtl_ofi_component.c:610
  Error: No data available (61)
--------------------------------------------------------------------------
/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/computation/expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed).
  from pandas.core.computation.check import NUMEXPR_INSTALLED

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
  warn(msg)
ERROR: python: undefined symbol: cudaRuntimeGetVersion
CUDA SETUP: libcudart.so path is None
CUDA SETUP: Is seems that your cuda installation is not in your path. See https://github.com/TimDettmers/bitsandbytes/issues/85 for more information.
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
  warn(msg)
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 00
CUDA SETUP: Loading binary /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
Loading checkpoint shards:   0%|                                                                                                                                                      | 0/9 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3228, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 728, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/utils/bitsandbytes.py", line 89, in set_module_quantized_tensor_to_device
    new_value = bnb.nn.Int8Params(new_value, requires_grad=False, **kwargs).to(device)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 294, in to
    return self.cuda(device)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 258, in cuda
    CB, CBt, SCB, SCBt, coo_tensorB = bnb.functional.double_quant(B)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/functional.py", line 1987, in double_quant
    row_stats, col_stats, nnz_row_ptr = get_colrow_absmax(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/functional.py", line 1876, in get_colrow_absmax
    lib.cget_col_row_stats(ptrA, ptrRowStats, ptrColStats, ptrNnzrows, ct.c_float(threshold), rows, cols)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats

The test.py script is:

# Runs Falcon-40B Instruct in 8bit mode which should take ~45GB of RAM

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "tiiuae/falcon-40b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    load_in_8bit=True,
    device_map="auto",
)

print(f'Loaded {model_id}')

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

prompt = "Write a poem about Valencia."

print(f'Prompt: {prompt}\n')

sequences = pipeline(
    prompt,
    max_length=500,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

The issue is with bitsandbytes. It wasn’t able to locate the CUDA runtime. One solution is to run the following script to add the library path to .bashrc:

#!/bin/bash

# Find the file location
FILE_LOCATION=$(find / -name libcudart.so 2>/dev/null)

# If the file was found, add it to the LD_LIBRARY_PATH
if [ -n "$FILE_LOCATION" ]; then
  LIB_PATH=${FILE_LOCATION%/*}
  echo "Found path: $LIB_PATH"

  # Check if the path is already in .bashrc
  if ! grep -q "LD_LIBRARY_PATH=.*$LIB_PATH" ~/.bashrc; then
    echo "Updating .bashrc with the found path..."
    echo "export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:$LIB_PATH" >> ~/.bashrc
    echo ".bashrc updated. Please restart your terminal or run 'source ~/.bashrc'"
  else
    echo "The path is already in .bashrc"
  fi
else
  echo "File libcudart.so not found."
fi

I forgot to add that Falcon does not currently run on a LambdaLabs H100 with this setup, but it worked for me on an A6000.

Hi @Gadersd!

Can you send the error you are getting on the H100 instance?

Best,
Yanos

It was a cuBLAS error. See cuBLAS API failed with status 15 - Error · Issue #174 · tloen/alpaca-lora (github.com)
It only occurs for me on H100.

@Gadersd , please start a ticket in https://support.lambdalabs.com
and we can look more into it.