Too many open files

pathos00011 · June 18, 2023, 5:54pm

I seem to only get this while using an A100 with an attached filesystem. I have not done anything unusual that I know of. ulimit -n shows 1048576.

I have not run into this with any other instances before, the A6000 I was using previously with the filesystem had never ran into this problem.

Any ideas what might cause this?

mpapili · June 19, 2023, 10:13pm

The next time this happens can you capture the output of:

# print [num-handlers] [PID] for top 10 processes 
lsof -n | awk '{print $2}' | sort | uniq -c | sort -rn | head -n 10

and then the names of the processes attached to each PID (should be the second arg) with:

ps aux | grep -i <PID>

-once you found your likely culprit you can do lsof -p <PID> and we’ll be able to see what these files are exactly.

yapee23 · July 28, 2023, 8:11am

Hi, did you find any solution for this? I am running into the same error and the instance fails every time I get close to finishing epoch 1. It fails before even finishing the first epoch during model training with this exact error “Too many open files.”

I am using the Lambda labs Persistent Storage beta as well.

yanos · August 1, 2023, 12:11am

Hello!
@pathos00011, @yapee23
Can you send us an example of the command you run that causes this to happen, so we can reproduce this?

Best,
Yanos

aigility · September 18, 2023, 9:52pm

I’ve run into this same error. Linking my current thread on it:

Very easy to reproduce. Create a lambda cloud file system. Attach it to an instance. Put 2M dummy files into it (just doing that alone may have you run into the error). Then run
aws s3 sync ./local_file_dir/ s3://some-s3-file-bucket-test/
You’ll run into it before the folder is finished syncing.

I’m convinced this is a lambda cloud file system error, as aws s3 sync is engineered to scale and I have seen it in other code contexts multiple times.

Additionally, lsof seems unable to diagnose such open file handles. For example, the highest number of open file handles comes from jupyter-lab process when running lsof -n | awk '{print $2}' | sort | uniq -c | sort -rn | head -n 10, which I don’t think is the issue because (1) I’ve killed it and stopped it from spawning, which still doesn’t solve it, and (2) jupyter-lab shows a high number of open file handles, even when the instance is freshly restarted.

The only solution so far that’s worked for me is restarting the instance, but this makes lambda labs untenable for me for normal-scale machine learning work.

@mpapili @yanos

soheil · October 5, 2023, 4:11am

I typically resolve this type of issue by either

Setting PAM Limits in /etc/pam.d/common-session

session required pam_limits.so

Or just a ulimit -n unlimited

Topic		Replies	Views
"Too many open files" errors in many contexts Technical Help	1	480	September 30, 2023
Lambda <> Openrouter Woes Model Debugging	11	193	January 17, 2025
Low Disk Space on filesystem.root Technical Help	2	2612	August 15, 2022
How do I set a spend limit on Lambda Cloud? Technical Help	1	720	March 8, 2024
Closed lambda labs account	1	1046	July 13, 2023

Too many open files

Related topics