Synced from docling-serve v1.21.0
This page summarizes the docling-serve documentation at v1.21.0. For the exhaustive reference, follow the links to the source repository.
Deployment
Get docling-serve running on one machine fast. For cluster/production hardening, follow the links to the docling-serve repo.
Two independent choices shape how you run it:
- How you start the process — the
docling-servecommand, or Docker Compose. - Which compute engine runs the jobs (
DOCLING_SERVE_ENG_KIND) — the in-process Local engine (default), or the Redis-backed RQ engine.
| Local engine (default) | RQ engine (Redis + workers) | |
|---|---|---|
docling-serve command |
Quickstart / dev | Distributed |
| Docker Compose | Containerized single node (+GPU) | → serve repo |
Configuration
docling-serve is configured by CLI flags or environment variables. Precedence is environment variable > config file > defaults.
Subprocess gotcha
When uvicorn runs with --reload or --workers > 1 it spawns subprocesses, and CLI flags (e.g. --enable-ui, --artifacts-path) are ignored. Use the DOCLING_SERVE_* environment variables in those deployments.
Most common settings
| Setting (env var) | What it does | Default |
|---|---|---|
UVICORN_HOST / UVICORN_PORT |
bind address / port | 0.0.0.0 / 5001 |
UVICORN_WORKERS |
uvicorn worker processes | 1 |
DOCLING_SERVE_API_KEY |
require an X-Api-Key header |
unset |
DOCLING_SERVE_ENABLE_UI |
serve the Gradio demo UI at /ui |
false |
DOCLING_SERVE_ARTIFACTS_PATH |
local path to pre-downloaded models | unset (auto-download) |
DOCLING_SERVE_MAX_NUM_PAGES / DOCLING_SERVE_MAX_FILE_SIZE |
per-request limits | unset |
DOCLING_SERVE_ENG_KIND |
async engine: local or rq (also kfp/ray — see serve repo) |
local |
See the full reference in the source repo: configuration.md and .env.example.
Docling settings (env vars)
These tune Docling itself and are read by the server too:
| Env var | What it does | Default |
|---|---|---|
DOCLING_DEVICE |
inference device: cpu / cuda / mps |
auto |
DOCLING_NUM_THREADS |
CPU threads | runtime default |
DOCLING_PERF_PAGE_BATCH_SIZE |
pages per batch | runtime default |
DOCLING_PERF_ELEMENTS_BATCH_SIZE |
elements per batch | runtime default |
DOCLING_DEBUG_PROFILE_PIPELINE_TIMINGS |
log per-stage timings | false |
For how to choose device/perf values see GPU support. For offline / air-gapped model setup see the FAQ and Advanced options; set DOCLING_SERVE_ARTIFACTS_PATH to a pre-populated model directory.
Compute engines
docling-serve runs each conversion as an asynchronous job dispatched to a compute engine, chosen with DOCLING_SERVE_ENG_KIND:
- Local (
local, the default) — jobs run in an in-process thread pool inside the server. No external services; everything stays on one host. Tunable withDOCLING_SERVE_ENG_LOC_NUM_WORKERS(default2) andDOCLING_SERVE_ENG_LOC_SHARE_MODELS(defaultfalse). Best for a single machine. - RQ (
rq) — jobs are queued in Redis and executed by separatedocling-serve rq-workerprocesses, so the API tier and the conversion workers scale independently. Best for horizontal scaling and higher throughput. - KFP / Ray — Kubeflow Pipelines and Ray engines for cluster orchestration; see the serve repo.
Running it
Simple command (Local engine — the default quickstart)
pip install "docling-serve[ui]"
docling-serve run --enable-ui # production-style: reload off, binds 0.0.0.0, UI off by default
# docling-serve dev # dev: auto-reload, binds 127.0.0.1, UI on (localhost only)
API at http://localhost:5001, interactive docs at /docs, demo UI at /ui. Smoke test:
curl -X POST "http://localhost:5001/v1/convert/source/async" \
-H "Content-Type: application/json" \
-d '{"http_sources": [{"url": "https://arxiv.org/pdf/2501.17887"}]}'
Note
The demo UI (--enable-ui / DOCLING_SERVE_ENABLE_UI) is a Gradio app; files it produces are cleared from its cache after ~10 hours. It is a demonstrator, not durable storage.
Docker Compose (incl. local GPU)
Same server, containerized. The shipped compose examples are all-in-one containers that don't set ENG_KIND, so they run the default Local engine.
# Pure CPU (no compose)
podman run -p 5001:5001 -e DOCLING_SERVE_ENABLE_UI=1 quay.io/docling-project/docling-serve
# NVIDIA GPU
docker compose -f compose-nvidia.yaml up -d
# AMD GPU
docker compose -f compose-amd.yaml up -d
Compose manifests: compose-nvidia.yaml, compose-amd.yaml.
GPU prerequisites (host side; for the Python AcceleratorOptions view see GPU support and RTX GPU):
- NVIDIA — driver ≥ 550.54.14 +
nvidia-container-toolkit+ the nvidia container runtime. - AMD — AMDGPU/ROCm ≥ 6.3; the ROCm image is not published, build it with
make docling-serve-rocm-image. Detailed GID wiring: serve repo.
Note
The compose files pin older image tags (-cu126:main, -rocm72:main) than the README image table; treat the README image table as the source of truth and adjust the image: line if needed. There is no shipped single-CPU compose file — use the podman one-liner for pure CPU.
RQ engine (distributed: Redis + separate workers)
The API enqueues jobs to Redis; conversion runs in separate docling-serve rq-worker processes.
# 1) Redis
docker run -p 6379:6379 redis:7-alpine
# 2) API server (enqueues jobs)
DOCLING_SERVE_ENG_KIND=rq \
DOCLING_SERVE_ENG_RQ_REDIS_URL=redis://localhost:6379/0 \
docling-serve run
# 3) one or more workers (do the conversion)
DOCLING_SERVE_ENG_KIND=rq \
DOCLING_SERVE_ENG_RQ_REDIS_URL=redis://localhost:6379/0 \
docling-serve rq-worker
Warning
The API alone accepts jobs but nothing runs them without at least one rq-worker. DOCLING_SERVE_ENG_RQ_REDIS_URL is required (no default) and must be identical across every API and worker process.
Cluster, production & advanced variants
These live in the docling-serve repo (run-time manifests aren't vendored here):
- OpenShift — simple deployment
- Multi-worker RQ on Kubernetes (Redis + worker pods + secret)
- Secure deployment with oauth-proxy / TLS / Route
- ReplicaSets with sticky sessions (task state is node-local)
- Model-cache PVC/Job (pre-baking weights)
- KFP / Ray engines, OpenTelemetry, CUDA image-tagging policy → serve repo
Prefer not to run any of this yourself? See the managed service.