Llocalhost

Measurement Method

Client
Endpoint
Request with prompt
Prompt Processing Time
First token
Token Generation Time
Last token
  • Client — Custom OpenAI-compatible client created for benchmarking, running on the same local network.
  • Endpoint — OpenAI-compatible endpoint as provided by the inference app.

Every benchmark measures the time between sending a prompt and receiving the complete response. Streaming is used to capture the arrival of the first token, which divides the total time into two phases:

  • Prompt Processing (PP) — Time from sending the request until the first token arrives.
  • Token Generation (TG) — Time from the first token until the last.

Both phases combined yield the Total Time. For calculating tokens per second, the respective token counts are divided by these durations.

A Note on Timing

Technically, the first generated token is part of the generation phase, not prompt processing. Including it in the "Prompt Processing Time" measurement introduces a small bias. However, this approach lets us test all endpoints uniformly using the same streaming-based method, without relying on bench tools or logs.

It also mirrors the user experience: prompt processing feels like the wait until you see the first character appear.