Run Your Own
Depending on your hardware, you can run your own local inference with the help of the published launch commands.
Prerequisites
- Hugging Face CLI — for downloading models
- Docker — for running the inference server
- NVIDIA Container Toolkit — for GPU access inside Docker (Nvidia only)
Download the Model
Go to the LLMs page and find the model you want to run. Each quantization shows a download command.
hf download unsloth/Qwen3-4B-Instruct-2507-GGUF --include Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf Get the Launch Command
Find a benchmark that matches your hardware and model of interest. Click through to the details page — it shows the exact launch command used.
Adjust as needed for your setup. There is a good chance that you don't need to change anything as long as the basics match, e.g. single GPU from the same manufacturer, same model.
sudo docker run --rm --ipc=host \
--gpus all \
-v ~/.cache/huggingface/hub/models--unsloth--Qwen3-4B-Instruct-2507-GGUF:/model:ro \
-v ~/.cache/llocalhost/llama-cpp/llama-cpp_runner-1.0.4-cuda-12.9.1-devel-ubuntu24.04:/script-work-dir \
-e BUILDER_WORK_DIR="/script-work-dir" \
-e BUILDER_GIT_REF=b7833 \
-e BUILDER_CMD='cmake -B build -DGGML_CUDA=ON' \
-p 8080:8080 \
llocalhost/llama-cpp:runner-1.0.4-cuda-12.9.1-devel-ubuntu24.04 \
./llama-server \
--model "/model/snapshots/a06e946bb6b655725eafa393f4a9745d460374c9/Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf" \
--alias "qwen3-4b-instruct-2507_UD-Q4-K-XL" \
--n-gpu-layers 999 \
--flash-attn on \
--batch-size 2048 \
--ubatch-size 1024 \
--no-mmap \
--ctx-size 17300 \
--temp 0.7 --top-p 0.8 --min-p 0 --top-k 20 \
--jinja \
--host 0.0.0.0 --port 8080 Run Inference
Run the launch command from step 2 to start the inference server. Once running, you can connect any OpenAI-compatible client to http://<host>:8080.
With llama.cpp, you will even find a fully functional chat interface at that URL.
Optional: Reproduce Speed Benchmark
Use the Test Prompt Generator to create prompts of specific token lengths. If you want to reproduce a speed benchmark and need to make sure 500 tokens are generated, limit the generated tokens in your chat UI to 500.
Please note that you will very likely not hit the exact desired prompt length because of models using different tokenizers.
Tips
- Keep the model going: You may need to change the instruction given in the test prompt if the model stops before reaching 500 tokens. Telling it to translate the test prompt to a different language is one option.