I am looking for best-of-breed job scheduling and prioritization solutions for our GPU cluster which consists of 5 gpu nodes with 8 gpus each and 2 cpu nodes. We have made the decision to run Kubernetes + Kubeflow and we are looking for something that is compatible with our hardware and software.
My hope is that someone on this forum will have seen something that might meet our needs and doesn’t require a lot of work to get running and maintain.
Our needs are fairly simple
- Long-running and short-running scheduling queues
- Preempt Ability so that at a checkpoint short running development jobs can temporarily get some time on the GPUs and automatically go back to the long-running jobs when they finish.
- An interactive calendar or queue so people can reserve dedicated time.
- Compatibility with Kubernetes and Kubeflow
We have looked at a number of solutions and would appreciate some advice.