I tried to install Nvidia’s Transformer Engine library on a H100 instance, following the documentation:
pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable
but the installation failed with a ModuleNotFoundError
error.
Pytorch is installed, and import torch
works from the interpreter. What could be the issue ?
Collecting git+https://github.com/NVIDIA/TransformerEngine.git@main
Cloning https://github.com/NVIDIA/TransformerEngine.git (to revision main) to /tmp/pip-req-build-jre6s3f9
Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/TransformerEngine.git /tmp/pip-req-build-jre6s3f9
Resolved https://github.com/NVIDIA/TransformerEngine.git to commit 144e4888b2cdd60bd52e706d5b7a79cb9c1a7156
Running command git submodule update --init --recursive -q
Preparing metadata (setup.py) ... done
Collecting flash-attn==1.0.6
Using cached flash_attn-1.0.6.tar.gz (2.0 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [15 lines of output]
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/home/ubuntu/.local/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/ubuntu/.local/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-hb2bg3j0/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-hb2bg3j0/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-hb2bg3j0/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "<string>", line 13, in <module>
ModuleNotFoundError: No module named 'torch'
[end of output]
edit: I also get an error on an A100 instance:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'