I have been a customer for 3 years, mostly corresponding with @jeremy for tech support. I have a QUAD Titan V and Ubuntu 20.04.
I wrote 3 days ago to support but haven’t heard back. After the latest update the whole ML stack seems to be wiped out and the GPUs are inaccessible. Here’s the output of ‘nvidia-smi’:
ivogeorg@turnaround:~$ nvidia-smi
Command 'nvidia-smi' not found, but can be installed with:
sudo apt install nvidia-340 # version 340.108-0ubuntu5.20.04.2, or
sudo apt install nvidia-utils-390 # version 390.143-0ubuntu0.20.04.1
sudo apt install nvidia-utils-450-server # version 450.119.04-0ubuntu0.20.04.2
sudo apt install nvidia-utils-460 # version 460.80-0ubuntu0.20.04.2
sudo apt install nvidia-utils-465 # version 465.27-0ubuntu0.20.04.2
sudo apt install nvidia-utils-435 # version 435.21-0ubuntu7
sudo apt install nvidia-utils-440 # version 440.82+really.440.64-0ubuntu6
sudo apt install nvidia-utils-418-server # version 418.197.02-0ubuntu0.20.04.1
sudo apt install nvidia-utils-460-server # version 460.73.01-0ubuntu0.20.04.1
Even the NVIDIA X Server Settings is just a blank box. I do remember using ‘sudo apt autoremove’ after the update, but that looked like it removed the stack packages as well. Btw, the update has been acting strangely in general (e.g. it would report packages that needed to be updated or removed but would be a blank list or the list wouldn’t change on OK). It might be related and I might have inadvertently removed stack packages.
Why would that happen? What config has been lost or overridden? What should I do to restore the system?
If it’s inevitable, I will reinstall Ubuntu, but I have had difficulties restoring my disk setup after reinstall so I’d rather avoid it.
I do not see a ticket from you for 3 months. ‘support@lambdalabs.com’ should create a ticket.
And Jeremy is still here and active.
It looks like it was removed since the ‘nvidia-smi’ command is not even found. So it looks like the
nvidia software is not installed. You can email me or ‘support@lambdalabs.com’ and we can walk
through this or do it for you and make sure it is working.
The following will remove all NVidia software, deep learning libraries, and then reinstall Lambda Stack.
Thanks for the reply. The commands did nothing for me. It looks like the stack was re-installed, but nvidia-smi is not found:
ivogeorg@turnaround:~$ sudo rm -f /etc/apt/sources.list.d/{graphics,nvidia,cuda}*; \
> COLUMNS=200 dpkg -l |
> awk '/cuda|lib(accinj64|cu(blas|dart|dnn|fft|inj|pti|rand|solver|sparse)|magma|nccl|npp|nv[^p])|nv(idia
> |ml)|tensor(flow|board)|torch/ { print $2 }' |
> sudo xargs -or apt -y remove --purge && LAMBDA_REPO=$(mktemp) && \
> wget -O${LAMBDA_REPO} https://lambdalabs.com/static/misc/lambda-stack-repo.deb && \
> sudo dpkg -i ${LAMBDA_REPO} && rm -f ${LAMBDA_REPO} && \
> sudo apt-get -y update && sudo apt-get -y install lambda-stack-cuda
[sudo] password for ivogeorg:
awk: line 1: runaway regular expression /cuda|lib(a ...
--2021-07-05 19:03:30-- https://lambdalabs.com/static/misc/lambda-stack-repo.deb
Resolving lambdalabs.com (lambdalabs.com)... 13.56.92.69, 52.8.229.234
Connecting to lambdalabs.com (lambdalabs.com)|13.56.92.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3588 (3.5K) [application/octet-stream]
Saving to: ‘/tmp/tmp.ojKwsSOJIv’
/tmp/tmp.ojKwsSOJIv 100%[===============================================>] 3.50K --.-KB/s in 0s
2021-07-05 19:03:30 (387 MB/s) - ‘/tmp/tmp.ojKwsSOJIv’ saved [3588/3588]
(Reading database ... 359550 files and directories currently installed.)
Preparing to unpack /tmp/tmp.ojKwsSOJIv ...
Unpacking lambda-repository (0.1) over (0.1) ...
Setting up lambda-repository (0.1) ...
Hit:1 http://dl.google.com/linux/chrome/deb stable InRelease
Get:2 https://download.docker.com/linux/ubuntu focal InRelease [52.1 kB]
Hit:3 http://archive.lambdalabs.com/ubuntu focal InRelease
Hit:4 http://packages.osrfoundation.org/gazebo/ubuntu-stable focal InRelease
Hit:5 https://packages.microsoft.com/repos/ms-teams stable InRelease
Hit:6 http://archive.ubuntu.com/ubuntu focal InRelease
Get:7 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal-backports InRelease [101 kB]
Get:9 http://archive.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Fetched 380 kB in 1s (303 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree
Reading state information... Done
lambda-stack-cuda is already the newest version (0.1.12~20.04.3).
The following packages were automatically installed and are no longer required:
linux-headers-5.8.0-53-generic linux-hwe-5.8-headers-5.8.0-53 linux-image-5.8.0-53-generic
linux-modules-5.8.0-53-generic linux-modules-extra-5.8.0-53-generic
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
ivogeorg@turnaround:~$ nvidia-smi
Command 'nvidia-smi' not found, but can be installed with:
sudo apt install nvidia-340 # version 340.108-0ubuntu5.20.04.2, or
sudo apt install nvidia-utils-390 # version 390.143-0ubuntu0.20.04.1
sudo apt install nvidia-utils-450-server # version 450.119.04-0ubuntu0.20.04.2
sudo apt install nvidia-utils-460 # version 460.80-0ubuntu0.20.04.2
sudo apt install nvidia-utils-465 # version 465.27-0ubuntu0.20.04.2
sudo apt install nvidia-utils-435 # version 435.21-0ubuntu7
sudo apt install nvidia-utils-440 # version 440.82+really.440.64-0ubuntu6
sudo apt install nvidia-utils-418-server # version 418.197.02-0ubuntu0.20.04.1
sudo apt install nvidia-utils-460-server # version 460.73.01-0ubuntu0.20.04.1
It looks like I have unwittingly run the suggested sudo apt autoremove and removed the stack. But why would the software installer even suggest that?! Before that update, everything worked fine. I was trying to build Docker images with GPU support and other software which was clashing with the stack. I did install the nvidia container “stack” (which was more like a collage rather than a stack). I wonder if that superseded some of the stack settings and marked it for autoremove.
I worked with Jeremy mostly without creating new tickets. I think our correspondence was mostly under the same original ticket, which had long become outdated.
I guess writing here does not really open a ticket. So, is writing to ‘support@lambdalabs.com’ constitute a ticket?
Finally, you mention that I should email you or support: what is your email address?
wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | sh -
]
and sudo apt-get remove lambda-stack-cuda && sudo apt-get install lambda-stack-cuda
(The same via this youtube video here (https://www.youtube.com/watch?v=sEUOa0s-RQY ) .
Thanks for reply.
On my LambdaBlade with Ubuntu 20.04 and 4X A6000, I wanted to upgrade the cuda stack.
So I issued the this command: $ sudo apt-get remove lambda-stack-cuda && sudo apt-get install lambda-stack-cuda
After it successfully fetched 115 packages, extracted templates, read database containing files and directories, then started Removing grub-theme-lambda-text (0.1) …
and it got stuck for more than a hour.
When I tried Ctrl+C to kill the command, it would n’t respond.
I tried rebooting the system and it is not responding.
While rebooting it got stuck with the following messages Syncing Filesystems and Block devices – timed out, issuing SIGKILL to PID nnnnn
Should I raise a ticket? or Is there a simple procedure to restart in reinstall process?
When I rebooted again, the system came back. I am able to connect.
How to check whether the sudo apt-get install lambda-stack-cuda is complete?
When I do sudo apt-cache policy lambda-stack-cuda I get
Also when I checked tensorflow and torch versions I see the following:
Does it mean everything is installed?
Another question is the CUDA version is currently 11.7 per nvidia-smi and I want up grade to Cuda 12.1.
What is easy way to upgrade it?
FYI, I already created ticket.
Problem is solved.
Since via sudo apt-get remove lambda-stack-cuda && sudo apt-get install lambda-stack-cuda
This failed in the middle due to interruption and I have to do sudo dpkg --configure -a and reissue the above install of lambda-stack-cuda.
This went through, I rebooted the system and all are up to date.