Tranformer Engine installation fails

lzanini · June 2, 2023, 9:54pm

I tried to install Nvidia’s Transformer Engine library on a H100 instance, following the documentation:

pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable

but the installation failed with a ModuleNotFoundError error.

Pytorch is installed, and import torch works from the interpreter. What could be the issue ?

Collecting git+https://github.com/NVIDIA/TransformerEngine.git@main
  Cloning https://github.com/NVIDIA/TransformerEngine.git (to revision main) to /tmp/pip-req-build-jre6s3f9
  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/TransformerEngine.git /tmp/pip-req-build-jre6s3f9
  Resolved https://github.com/NVIDIA/TransformerEngine.git to commit 144e4888b2cdd60bd52e706d5b7a79cb9c1a7156
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... done
Collecting flash-attn==1.0.6
  Using cached flash_attn-1.0.6.tar.gz (2.0 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      Traceback (most recent call last):
        File "/home/ubuntu/.local/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/home/ubuntu/.local/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/home/ubuntu/.local/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/tmp/pip-build-env-hb2bg3j0/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "/tmp/pip-build-env-hb2bg3j0/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-hb2bg3j0/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 338, in run_setup
          exec(code, locals())
        File "<string>", line 13, in <module>
      ModuleNotFoundError: No module named 'torch'
      [end of output]

edit: I also get an error on an A100 instance:

FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'

yanos · June 3, 2023, 1:32am

Hi!

The easiest way to have Nvidia’s Transformer Engine library is the NGC Pytorch Docker image.

The Transformer Engine library is preinstalled in the PyTorch container in versions 22.09 and later on NVIDIA GPU Cloud.
(ref. Installation — Transformer Engine 0.9.0 documentation)

First, you will need to create an account to be able to access the NVIDIA Docker Registry.
After you register:

Login:

$ docker login https://ngc.nvidia.com

And Pull the image:

$ docker pull nvcr.io/nvidia/pytorch:23.05-py3

Create a container and attach to it:

$ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.05-py3

root@d3e03123808c:/workspace# pip -v list |grep transformer-engine
`transformer-engine      0.8.0                           /usr/local/lib/python3.10/dist-packages pip

I hope this helps.

Best,
Yanos

lzanini · June 3, 2023, 5:02am

This works, thanks ! just need to add sudo to get docker commands to run on cloud instances.

Topic		Replies	Views
GH200 PyTorch installation failure Technical Help	0	80	February 21, 2025
NVIDIA container runtime error - Tensorbook	1	1651	September 27, 2020
How to run H100 with docker in sm_86 compatibility mode? Technical Help	6	5969	September 18, 2023
Nvidia-docker2 issue Technical Help	1	1385	March 4, 2022
Lambda Tensorbook - Unable to recognize GPU with PyTorch Technical Help	9	2917	March 4, 2022

Tranformer Engine installation fails

Related topics