Too many open files

I seem to only get this while using an A100 with an attached filesystem. I have not done anything unusual that I know of. ulimit -n shows 1048576.

I have not run into this with any other instances before, the A6000 I was using previously with the filesystem had never ran into this problem.

Any ideas what might cause this?

The next time this happens can you capture the output of:

# print [num-handlers] [PID] for top 10 processes 
lsof -n | awk '{print $2}' | sort | uniq -c | sort -rn | head -n 10

and then the names of the processes attached to each PID (should be the second arg) with:

ps aux | grep -i <PID>

-once you found your likely culprit you can do lsof -p <PID> and we’ll be able to see what these files are exactly.

1 Like

Hi, did you find any solution for this? I am running into the same error and the instance fails every time I get close to finishing epoch 1. It fails before even finishing the first epoch during model training with this exact error “Too many open files.”

I am using the Lambda labs Persistent Storage beta as well.

Hello!
@pathos00011, @yapee23
Can you send us an example of the command you run that causes this to happen, so we can reproduce this?

Best,
Yanos

I’ve run into this same error. Linking my current thread on it:

Very easy to reproduce. Create a lambda cloud file system. Attach it to an instance. Put 2M dummy files into it (just doing that alone may have you run into the error). Then run
aws s3 sync ./local_file_dir/ s3://some-s3-file-bucket-test/
You’ll run into it before the folder is finished syncing.

I’m convinced this is a lambda cloud file system error, as aws s3 sync is engineered to scale and I have seen it in other code contexts multiple times.

Additionally, lsof seems unable to diagnose such open file handles. For example, the highest number of open file handles comes from jupyter-lab process when running lsof -n | awk '{print $2}' | sort | uniq -c | sort -rn | head -n 10, which I don’t think is the issue because (1) I’ve killed it and stopped it from spawning, which still doesn’t solve it, and (2) jupyter-lab shows a high number of open file handles, even when the instance is freshly restarted.

The only solution so far that’s worked for me is restarting the instance, but this makes lambda labs untenable for me for normal-scale machine learning work.

@mpapili @yanos

I typically resolve this type of issue by either

Setting PAM Limits in /etc/pam.d/common-session

session required pam_limits.so

Or just a ulimit -n unlimited