Tranformer Engine installation fails

I tried to install Nvidia’s Transformer Engine library on a H100 instance, following the documentation:

pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable

but the installation failed with a ModuleNotFoundError error.

Pytorch is installed, and import torch works from the interpreter. What could be the issue ?

Collecting git+https://github.com/NVIDIA/TransformerEngine.git@main
  Cloning https://github.com/NVIDIA/TransformerEngine.git (to revision main) to /tmp/pip-req-build-jre6s3f9
  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/TransformerEngine.git /tmp/pip-req-build-jre6s3f9
  Resolved https://github.com/NVIDIA/TransformerEngine.git to commit 144e4888b2cdd60bd52e706d5b7a79cb9c1a7156
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... done
Collecting flash-attn==1.0.6
  Using cached flash_attn-1.0.6.tar.gz (2.0 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      Traceback (most recent call last):
        File "/home/ubuntu/.local/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/home/ubuntu/.local/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/home/ubuntu/.local/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/tmp/pip-build-env-hb2bg3j0/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "/tmp/pip-build-env-hb2bg3j0/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-hb2bg3j0/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 338, in run_setup
          exec(code, locals())
        File "<string>", line 13, in <module>
      ModuleNotFoundError: No module named 'torch'
      [end of output]

edit: I also get an error on an A100 instance:

FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'

Hi!

The easiest way to have Nvidia’s Transformer Engine library is the NGC Pytorch Docker image.

The Transformer Engine library is preinstalled in the PyTorch container in versions 22.09 and later on NVIDIA GPU Cloud.
(ref. Installation — Transformer Engine 0.9.0 documentation)

First, you will need to create an account to be able to access the NVIDIA Docker Registry.
After you register:

Login:

$ docker login https://ngc.nvidia.com

And Pull the image:

$ docker pull nvcr.io/nvidia/pytorch:23.05-py3

Create a container and attach to it:

$ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:23.05-py3
root@d3e03123808c:/workspace# pip -v list |grep transformer-engine
`transformer-engine      0.8.0                           /usr/local/lib/python3.10/dist-packages pip

I hope this helps.

Best,
Yanos

This works, thanks ! just need to add sudo to get docker commands to run on cloud instances.

1 Like