I’ve run into this same error. Linking my current thread on it:
Very easy to reproduce. Create a lambda cloud file system. Attach it to an instance. Put 2M dummy files into it (just doing that alone may have you run into the error). Then run
aws s3 sync ./local_file_dir/ s3://some-s3-file-bucket-test/
You’ll run into it before the folder is finished syncing.
I’m convinced this is a lambda cloud file system error, as aws s3 sync
is engineered to scale and I have seen it in other code contexts multiple times.
Additionally, lsof
seems unable to diagnose such open file handles. For example, the highest number of open file handles comes from jupyter-lab process when running lsof -n | awk '{print $2}' | sort | uniq -c | sort -rn | head -n 10
, which I don’t think is the issue because (1) I’ve killed it and stopped it from spawning, which still doesn’t solve it, and (2) jupyter-lab shows a high number of open file handles, even when the instance is freshly restarted.
The only solution so far that’s worked for me is restarting the instance, but this makes lambda labs untenable for me for normal-scale machine learning work.