I am Struggling with Transformer inference latency

Aaliyah · April 21, 2025, 1:28pm

Hey folks,

I have been experimenting with transformer models lately ( LLaMA and Mistral) & I am hitting a wall when it comes to inference latency. I am aiming for something close to real-time performance but even with a decent GPU (A6000) & using vLLM; the first-token latency is still kind of a pain point.

Also I have read a few things about model quantization and server-side batching; but I have not found a silver bullet yet. Also wondering if anyone here has tried offloading some parts of the pipeline to cloud-based environments. I get the basics of What is Cloud Computing but I want how it plays out in practice for running large models. Does it help with latency or just shift the bottlenecks around??

If you have deployed anything such as this in production or at scale; I want to hear what worked for you.
Any have any tips, lessons learned or pitfalls to avoid?

Thank you…

Topic		Replies	Views
Inference API Timeout Technical Help	2	53	June 6, 2025
Does Inference API support batch/asynchronous processing	1	73	March 13, 2025
I'm getting an HTTP code 524 response from the the inference API's Technical Help	0	41	March 2, 2025
Problem with testing full capabilities of models on cloud H100. Technical Help	0	40	October 27, 2024
Did inference api change recently? Technical Help	0	77	March 20, 2025

I am Struggling with Transformer inference latency

Related topics