Measurement Method

Client

Endpoint

Request with prompt

Prompt Processing Time

First token

Token Generation Time

Last token

Client — Custom OpenAI-compatible client created for benchmarking, running on the same local network.
Endpoint — OpenAI-compatible endpoint as provided by the inference app.

Every benchmark measures the time between sending a prompt and receiving the complete response. Streaming is used to capture the arrival of the first token, which divides the total time into two phases:

Prompt Processing (PP) — Time from sending the request until the first token arrives.
Token Generation (TG) — Time from the first token until the last.

Both phases combined yield the Total Time. For calculating tokens per second, the respective token counts are divided by these durations.

A Note on Timing

Technically, the first generated token is part of the generation phase, not prompt processing. Including it in the "Prompt Processing Time" measurement introduces a small bias. However, this approach lets us test all endpoints uniformly using the same streaming-based method, without relying on bench tools or logs.

It also mirrors the user experience: prompt processing feels like the wait until you see the first character appear.

A Note on Desktop Systems

On systems with an active GUI, I switch to a virtual console and terminate the graphical session to free up VRAM before running benchmarks. On Linux Mint, that's Ctrl+Alt+F2 followed by sudo systemctl stop lightdm.

Otherwise, some of the tested combinations and context sizes would not be possible.