Lambda Stack seems to be wiped out after Ubuntu update

Hello,

I have been a customer for 3 years, mostly corresponding with @jeremy for tech support. I have a QUAD Titan V and Ubuntu 20.04.

I wrote 3 days ago to support but haven’t heard back. After the latest update the whole ML stack seems to be wiped out and the GPUs are inaccessible. Here’s the output of ‘nvidia-smi’:

ivogeorg@turnaround:~$ nvidia-smi

Command 'nvidia-smi' not found, but can be installed with:

sudo apt install nvidia-340               # version 340.108-0ubuntu5.20.04.2, or
sudo apt install nvidia-utils-390         # version 390.143-0ubuntu0.20.04.1
sudo apt install nvidia-utils-450-server  # version 450.119.04-0ubuntu0.20.04.2
sudo apt install nvidia-utils-460         # version 460.80-0ubuntu0.20.04.2
sudo apt install nvidia-utils-465         # version 465.27-0ubuntu0.20.04.2
sudo apt install nvidia-utils-435         # version 435.21-0ubuntu7
sudo apt install nvidia-utils-440         # version 440.82+really.440.64-0ubuntu6
sudo apt install nvidia-utils-418-server  # version 418.197.02-0ubuntu0.20.04.1
sudo apt install nvidia-utils-460-server  # version 460.73.01-0ubuntu0.20.04.1

Even the NVIDIA X Server Settings is just a blank box. I do remember using ‘sudo apt autoremove’ after the update, but that looked like it removed the stack packages as well. Btw, the update has been acting strangely in general (e.g. it would report packages that needed to be updated or removed but would be a blank list or the list wouldn’t change on OK). It might be related and I might have inadvertently removed stack packages.

Why would that happen? What config has been lost or overridden? What should I do to restore the system?

If it’s inevitable, I will reinstall Ubuntu, but I have had difficulties restoring my disk setup after reinstall so I’d rather avoid it.

Thanks in advance!

–Ivo

Ivo,

I do not see a ticket from you for 3 months. ‘support@lambdalabs.com’ should create a ticket.
And Jeremy is still here and active.

It looks like it was removed since the ‘nvidia-smi’ command is not even found. So it looks like the
nvidia software is not installed. You can email me or ‘support@lambdalabs.com’ and we can walk
through this or do it for you and make sure it is working.

The following will remove all NVidia software, deep learning libraries, and then reinstall Lambda Stack.

sudo rm -f /etc/apt/sources.list.d/{graphics,nvidia,cuda}*; \
	COLUMNS=200 dpkg -l |
	awk '/cuda|lib(accinj64|cu(blas|dart|dnn|fft|inj|pti|rand|solver|sparse)|magma|nccl|npp|nv[^p])|nv(idia
|ml)|tensor(flow|board)|torch/ { print $2 }' |
	sudo xargs -or apt -y remove --purge && LAMBDA_REPO=$(mktemp) && \
	wget -O${LAMBDA_REPO} https://lambdalabs.com/static/misc/lambda-stack-repo.deb && \
	sudo dpkg -i ${LAMBDA_REPO} && rm -f ${LAMBDA_REPO} && \
	sudo apt-get -y update && sudo apt-get -y install lambda-stack-cuda

Try running that, then reboot, and let me know if you’re able to run nvidia-smi.

Mark

2 Likes

Hi Mark,

Thanks for the reply. The commands did nothing for me. It looks like the stack was re-installed, but nvidia-smi is not found:

ivogeorg@turnaround:~$ sudo rm -f /etc/apt/sources.list.d/{graphics,nvidia,cuda}*; \
> COLUMNS=200 dpkg -l |
> awk '/cuda|lib(accinj64|cu(blas|dart|dnn|fft|inj|pti|rand|solver|sparse)|magma|nccl|npp|nv[^p])|nv(idia
> |ml)|tensor(flow|board)|torch/ { print $2 }' |
> sudo xargs -or apt -y remove --purge && LAMBDA_REPO=$(mktemp) && \
> wget -O${LAMBDA_REPO} https://lambdalabs.com/static/misc/lambda-stack-repo.deb && \
> sudo dpkg -i ${LAMBDA_REPO} && rm -f ${LAMBDA_REPO} && \
> sudo apt-get -y update && sudo apt-get -y install lambda-stack-cuda
[sudo] password for ivogeorg: 
awk: line 1: runaway regular expression /cuda|lib(a ...
--2021-07-05 19:03:30--  https://lambdalabs.com/static/misc/lambda-stack-repo.deb
Resolving lambdalabs.com (lambdalabs.com)... 13.56.92.69, 52.8.229.234
Connecting to lambdalabs.com (lambdalabs.com)|13.56.92.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3588 (3.5K) [application/octet-stream]
Saving to: ‘/tmp/tmp.ojKwsSOJIv’

/tmp/tmp.ojKwsSOJIv           100%[===============================================>]   3.50K  --.-KB/s    in 0s      

2021-07-05 19:03:30 (387 MB/s) - ‘/tmp/tmp.ojKwsSOJIv’ saved [3588/3588]

(Reading database ... 359550 files and directories currently installed.)
Preparing to unpack /tmp/tmp.ojKwsSOJIv ...
Unpacking lambda-repository (0.1) over (0.1) ...
Setting up lambda-repository (0.1) ...
Hit:1 http://dl.google.com/linux/chrome/deb stable InRelease
Get:2 https://download.docker.com/linux/ubuntu focal InRelease [52.1 kB]                                             
Hit:3 http://archive.lambdalabs.com/ubuntu focal InRelease                                                           
Hit:4 http://packages.osrfoundation.org/gazebo/ubuntu-stable focal InRelease                                         
Hit:5 https://packages.microsoft.com/repos/ms-teams stable InRelease                               
Hit:6 http://archive.ubuntu.com/ubuntu focal InRelease                       
Get:7 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal-backports InRelease [101 kB]
Get:9 http://archive.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Fetched 380 kB in 1s (303 kB/s)    
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
lambda-stack-cuda is already the newest version (0.1.12~20.04.3).
The following packages were automatically installed and are no longer required:
  linux-headers-5.8.0-53-generic linux-hwe-5.8-headers-5.8.0-53 linux-image-5.8.0-53-generic
  linux-modules-5.8.0-53-generic linux-modules-extra-5.8.0-53-generic
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
ivogeorg@turnaround:~$ nvidia-smi

Command 'nvidia-smi' not found, but can be installed with:

sudo apt install nvidia-340               # version 340.108-0ubuntu5.20.04.2, or
sudo apt install nvidia-utils-390         # version 390.143-0ubuntu0.20.04.1
sudo apt install nvidia-utils-450-server  # version 450.119.04-0ubuntu0.20.04.2
sudo apt install nvidia-utils-460         # version 460.80-0ubuntu0.20.04.2
sudo apt install nvidia-utils-465         # version 465.27-0ubuntu0.20.04.2
sudo apt install nvidia-utils-435         # version 435.21-0ubuntu7
sudo apt install nvidia-utils-440         # version 440.82+really.440.64-0ubuntu6
sudo apt install nvidia-utils-418-server  # version 418.197.02-0ubuntu0.20.04.1
sudo apt install nvidia-utils-460-server  # version 460.73.01-0ubuntu0.20.04.1

It looks like I have unwittingly run the suggested sudo apt autoremove and removed the stack. But why would the software installer even suggest that?! Before that update, everything worked fine. I was trying to build Docker images with GPU support and other software which was clashing with the stack. I did install the nvidia container “stack” (which was more like a collage rather than a stack). I wonder if that superseded some of the stack settings and marked it for autoremove.

I worked with Jeremy mostly without creating new tickets. I think our correspondence was mostly under the same original ticket, which had long become outdated.

I guess writing here does not really open a ticket. So, is writing to ‘support@lambdalabs.com’ constitute a ticket?

Finally, you mention that I should email you or support: what is your email address?

Thanks in advance,

Ivo

Ivo! Hello again!

Interesting to see that nvidia-smi isn’t being installed
Looks like it’s a part of the nvidia-utils-460 package - let’s try installing that manually:

sudo apt install nvidia-utils-460

Maybe that will tell us why it didn’t install with lambda-stack-cuda.

@ivogeorg this should solve your problem, it did for me at least:

sudo apt-get remove lambda-stack-cuda && sudo apt-get install lambda-stack-cuda

This worked for me for Ubuntu 20.10

1 Like

Just want to clarify.
System LambdaBlade with Ubuntu 20.04:
Is there a difference between the command here Getting-started Removing and Installing Lambda-stack-in-ubuntu
[Uninstall (purge) the existing Lambda Stack by running:

sudo rm -f /etc/apt/sources.list.d/{graphics,nvidia,cuda}* && \
dpkg -l | \
awk '/cuda|lib(accinj64|cu(blas|dart|dnn|fft|inj|pti|rand|solver|sparse)|magma|nccl|npp|nv[^p])|nv(idia|ml)|tensor(flow|board)|torch/ { print $2 }' | \
sudo xargs -or apt -y remove --purge

Then, install the latest Lambda Stack by running:

wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | sh -

]
and
sudo apt-get remove lambda-stack-cuda && sudo apt-get install lambda-stack-cuda
(The same via this youtube video here (https://www.youtube.com/watch?v=sEUOa0s-RQY ) .

The removing and reinstalling Lambda Stack instructions additionally remove repositories and packages that might conflict with Lambda Stack.

Thanks for reply.
On my LambdaBlade with Ubuntu 20.04 and 4X A6000, I wanted to upgrade the cuda stack.
So I issued the this command:
$ sudo apt-get remove lambda-stack-cuda && sudo apt-get install lambda-stack-cuda
After it successfully fetched 115 packages, extracted templates, read database containing files and directories, then started
Removing grub-theme-lambda-text (0.1) …
and it got stuck for more than a hour.
When I tried Ctrl+C to kill the command, it would n’t respond.
I tried rebooting the system and it is not responding.
While rebooting it got stuck with the following messages Syncing Filesystems and Block devices – timed out, issuing SIGKILL to PID nnnnn
Should I raise a ticket? or Is there a simple procedure to restart in reinstall process?

I suggest submitting a support ticket.

Or, if you don’t have data that you need to preserve, you can use a recovery image to reinstall Ubuntu and Lambda Stack.

Hi Cody,

When I rebooted again, the system came back. I am able to connect.
How to check whether the
sudo apt-get install lambda-stack-cuda is complete?
When I do
sudo apt-cache policy lambda-stack-cuda I get

Also when I checked tensorflow and torch versions I see the following:
image

Does it mean everything is installed?

Another question is the CUDA version is currently 11.7 per nvidia-smi and I want up grade to Cuda 12.1.
What is easy way to upgrade it?
FYI, I already created ticket.

Problem is solved.
Since via
sudo apt-get remove lambda-stack-cuda && sudo apt-get install lambda-stack-cuda
This failed in the middle due to interruption and I have to do
sudo dpkg --configure -a and reissue the above install of lambda-stack-cuda.
This went through, I rebooted the system and all are up to date.