Cannot Run Falcon40B on H100

I am new to LambdaLabs and recently launched a H100 instance. I tried to run a script that tests the Falcon40B Instruct model but I get an error message when trying python Any help would be appreciated.


2023-06-26 21:30:50.884700: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-26 21:30:51.090124: I tensorflow/core/util/] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
WARNING: No preset parameters were found for the device that Open MPI

  Local host:            209-20-157-85
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4122

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           209-20-157-85
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: 209-20-157-85
  Location: mtl_ofi_component.c:610
  Error: No data available (61)
/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/computation/ UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed).
  from pandas.core.computation.check import NUMEXPR_INSTALLED

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to:
bin /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/
/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/ UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/ undefined symbol: cadam32bit_grad_fp32
CUDA_SETUP: WARNING! not found in any environmental path. Searching in backup paths...
/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/ UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
ERROR: python: undefined symbol: cudaRuntimeGetVersion
CUDA SETUP: path is None
CUDA SETUP: Is seems that your cuda installation is not in your path. See for more information.
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/ UserWarning: WARNING: No found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 00
CUDA SETUP: Loading binary /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/
Loading checkpoint shards:   0%|                                                                                                                                                      | 0/9 [00:04<?, ?it/s]
Traceback (most recent call last):
  File "", line 10, in <module>
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/auto/", line 479, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/", line 3228, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/", line 728, in _load_state_dict_into_meta_model
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/utils/", line 89, in set_module_quantized_tensor_to_device
    new_value = bnb.nn.Int8Params(new_value, requires_grad=False, **kwargs).to(device)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/nn/", line 294, in to
    return self.cuda(device)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/nn/", line 258, in cuda
    CB, CBt, SCB, SCBt, coo_tensorB = bnb.functional.double_quant(B)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/", line 1987, in double_quant
    row_stats, col_stats, nnz_row_ptr = get_colrow_absmax(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/", line 1876, in get_colrow_absmax
    lib.cget_col_row_stats(ptrA, ptrRowStats, ptrColStats, ptrNnzrows, ct.c_float(threshold), rows, cols)
  File "/usr/lib/python3.8/ctypes/", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib/python3.8/ctypes/", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/ undefined symbol: cget_col_row_stats

The script is:

# Runs Falcon-40B Instruct in 8bit mode which should take ~45GB of RAM

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "tiiuae/falcon-40b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(

print(f'Loaded {model_id}')

pipeline = transformers.pipeline(

prompt = "Write a poem about Valencia."

print(f'Prompt: {prompt}\n')

sequences = pipeline(
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

The issue is with bitsandbytes. It wasn’t able to locate the CUDA runtime. One solution is to run the following script to add the library path to .bashrc:


# Find the file location
FILE_LOCATION=$(find / -name 2>/dev/null)

# If the file was found, add it to the LD_LIBRARY_PATH
if [ -n "$FILE_LOCATION" ]; then
  echo "Found path: $LIB_PATH"

  # Check if the path is already in .bashrc
  if ! grep -q "LD_LIBRARY_PATH=.*$LIB_PATH" ~/.bashrc; then
    echo "Updating .bashrc with the found path..."
    echo "export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:$LIB_PATH" >> ~/.bashrc
    echo ".bashrc updated. Please restart your terminal or run 'source ~/.bashrc'"
    echo "The path is already in .bashrc"
  echo "File not found."

I forgot to add that Falcon does not currently run on a LambdaLabs H100 with this setup, but it worked for me on an A6000.

Hi @Gadersd!

Can you send the error you are getting on the H100 instance?


It was a cuBLAS error. See cuBLAS API failed with status 15 - Error · Issue #174 · tloen/alpaca-lora (
It only occurs for me on H100.

@Gadersd , please start a ticket in
and we can look more into it.