"ChildFailedError" fine tuning Video-Llama and Video-ChatGPT

Hi Everyone,

We have been trying to fine-tune video-based visual language models like Video-ChatGPT and Video-LLaMa on our custom dataset. We are trying to fine-tune them on A6000 machine and we are getting error as shown below. Any insights into resolving this error would be helpful

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 915992) of binary: /home/pavana/anaconda3/envs/video_chatgpt/bin/python
Traceback (most recent call last):
File “/home/pavana/anaconda3/envs/video_chatgpt/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/pavana/anaconda3/envs/video_chatgpt/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/home/pavana/anaconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/distributed/run.py”, line 798, in
main()
File “/home/pavana/anaconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper
return f(*args, **kwargs)
File “/home/pavana/anaconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/distributed/run.py”, line 794, in main
run(args)
File “/home/pavana/anaconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/distributed/run.py”, line 785, in run
elastic_launch(
File “/home/pavana/anaconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/pavana/anaconda3/envs/video_chatgpt/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/pavana/Video-ChatGPT/video_chatgpt/train/train_mem.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-10-25_09:51:53
host : cis-a6000
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 915992)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.1 documentation

The simple answer is you are running distrubuted, and parent process is telling you that one of the child processes failed. It is not clear for which reason, but it could be:

  • In sufficient resources for the child process (GPU, GPU memory, CPU, memory)
  • Perhaps if this is a remote host it could be different python script, data, libraries, etc. If this is distributed across nodes. (or the Anaconda environment you are running on for the ‘parent process’)

It is mentioned in with similar issues:

And of course enable trackback on the child process/worker:
https://pytorch.org/docs/stable/elastic/errors.html