Measurement Method
- Client — Custom OpenAI-compatible client created for benchmarking, running on the same local network.
- Endpoint — OpenAI-compatible endpoint as provided by the inference app.
Every benchmark measures the time between sending a prompt and receiving the complete response. Streaming is used to capture the arrival of the first token, which divides the total time into two phases:
- Prompt Processing (PP) — Time from sending the request until the first token arrives.
- Token Generation (TG) — Time from the first token until the last.
Both phases combined yield the Total Time. For calculating tokens per second, the respective token counts are divided by these durations.
A Note on Timing
Technically, the first generated token is part of the generation phase, not prompt processing. Including it in the "Prompt Processing Time" measurement introduces a small bias. However, this approach lets us test all endpoints uniformly using the same streaming-based method, without relying on bench tools or logs.
It also mirrors the user experience: prompt processing feels like the wait until you see the first character appear.
A Note on Desktop Systems
On systems with an active GUI, I switch to a virtual console and terminate the graphical session to free up VRAM before running benchmarks.
On Linux Mint, that's Ctrl+Alt+F2 followed by sudo systemctl stop lightdm.
Otherwise, some of the tested combinations and context sizes would not be possible.