Inference API Timeout

I am currently using Lambda Labs credits on the Inference API for a hackathon, and I am finding that llama3.3-70b-instruct-fp8 as well as many other models time out really quickly when using the OpenAI API client. I cannot get past 30 iterations in a pipeline with 3k token length inputs and 20 token outputs, even when calling time.sleep(1) and setting the OpenAI timeout to 120s.

I initialize the client as follows:

openai_api_key = os.getenv("LAMBDA_API_KEY")
openai_api_base = "https://api.lambda.ai/v1"

# Initialize the OpenAI client
openai_client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
    timeout=120
)

Then I query for a summary:


def get_gpt_summary(article, dataset, model) -> str:
    history = [
        {"role": "system", "content": DATASET_SYSTEM_PROMPTS[dataset]},
        {
            "role": "user",
            "content": f"Article:\n{article}\n\nProvide only the summary with no other text.",
        },
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=history,
    )
    return response.choices[0].message.content

Even when I put a sleep(1) call in between the calls to get_gpt_summary, I still get a Timeout error from the OpenAI API after anywhere from 5-50 iterations. This is incredibly frustrating and has bottlenecked my research efforts. The effect applies irrespective of model choice, and the timeout=120 does not help. The inference module is usually 6-12 seconds, and the timeout is triggered abnormally early, suggesting that there is a different problem.

Please let me know if you need more information.

The following modification to the script on the Inference API Doc reproduces the error:

from openai import OpenAI
import os
import dotenv
from tqdm import tqdm

dotenv.load_dotenv()

# Set API credentials and endpoint
openai_api_key = os.getenv("LAMBDA_API_KEY")
openai_api_base = "https://api.lambda.ai/v1"

# Initialize the OpenAI client
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Choose the model
model = "llama-4-scout-17b-16e-instruct"
for i in tqdm(range(100)):
    # Create a multi-turn chat completion request
    chat_completion = client.chat.completions.create(
        messages=[{
            "role": "system",
            "content": "You are an expert conversationalist who responds to the best of your ability."
        }, {
            "role": "user",
            "content": "Who won the world series in 2020?"
        }, {
            "role": "assistant",
            "content": "The Los Angeles Dodgers won the World Series in 2020."
        }, {
            "role": "user",
            "content": "Where was it played?"
        }],
        model=model,
    )

    # Print the full chat completion response
    

I am facing the exact same issue with the inference api. This is the raw error traceback. Would love to get some support on this. Getting error 524: a timeout occurred.