Run Your Own

Depending on your hardware, you can run your own local inference with the help of the published launch commands.

Prerequisites

Hugging Face CLI — for downloading models
Docker — for running the inference server
NVIDIA Container Toolkit — for GPU access inside Docker (Nvidia only)

Download the Model

Go to the LLMs page and find the model you want to run. Each quantization shows a download command.

Example

 hf download unsloth/Qwen3-4B-Instruct-2507-GGUF --include Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf 

Get the Launch Command

Find a benchmark that matches your hardware and model of interest. Click through to the details page — it shows the exact launch command used.

Adjust as needed for your setup. There is a good chance that you don't need to change anything as long as the basics match, e.g. single GPU from the same manufacturer, same model.

Example

 sudo docker run --rm --ipc=host \
--gpus all \
-v ~/.cache/huggingface/hub/models--unsloth--Qwen3-4B-Instruct-2507-GGUF:/model:ro \
-v ~/.cache/llocalhost/llama-cpp/llama-cpp_runner-1.0.4-cuda-12.9.1-devel-ubuntu24.04:/script-work-dir \
-e BUILDER_WORK_DIR="/script-work-dir" \
-e BUILDER_GIT_REF=b7833 \
-e BUILDER_CMD='cmake -B build -DGGML_CUDA=ON' \
-p 8080:8080 \
llocalhost/llama-cpp:runner-1.0.4-cuda-12.9.1-devel-ubuntu24.04 \
./llama-server \
--model "/model/snapshots/a06e946bb6b655725eafa393f4a9745d460374c9/Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf" \
--alias "qwen3-4b-instruct-2507_UD-Q4-K-XL" \
--n-gpu-layers 999 \
--flash-attn on \
--batch-size 2048 \
--ubatch-size 1024 \
--no-mmap \
--ctx-size 17300 \
--temp 0.7 --top-p 0.8 --min-p 0 --top-k 20 \
--jinja \
--host 0.0.0.0 --port 8080 

Run Inference

Run the launch command from step 2 to start the inference server. Once running, you can connect any OpenAI-compatible client to http://<host>:8080.

With llama.cpp, you will even find a fully functional chat interface at that URL.

Optional: Reproduce Speed Benchmark

Use the Test Prompt Generator to create prompts of specific token lengths. If you want to reproduce a speed benchmark and need to make sure 500 tokens are generated, limit the generated tokens in your chat UI to 500.

Please note that you will very likely not hit the exact desired prompt length because of models using different tokenizers.

Tips

Keep the model going: You may need to change the instruction given in the test prompt if the model stops before reaching 500 tokens. Telling it to translate the test prompt to a different language is one option.