Hey folks,
I have been experimenting with transformer models lately ( LLaMA and Mistral) & I am hitting a wall when it comes to inference latency. I am aiming for something close to real-time performance but even with a decent GPU (A6000) & using vLLM; the first-token latency is still kind of a pain point.
Also I have read a few things about model quantization and server-side batching; but I have not found a silver bullet yet. Also wondering if anyone here has tried offloading some parts of the pipeline to cloud-based environments. I get the basics of What is Cloud Computing but I want how it plays out in practice for running large models. Does it help with latency or just shift the bottlenecks around??
If you have deployed anything such as this in production or at scale; I want to hear what worked for you.
Any have any tips, lessons learned or pitfalls to avoid?
Thank you…