Synced from docling-serve v1.21.0

This page summarizes the docling-serve documentation at v1.21.0. For the exhaustive reference, follow the links to the source repository.

Deployment

Get docling-serve running on one machine fast. For cluster/production hardening, follow the links to the docling-serve repo.

Two independent choices shape how you run it:

How you start the process — the docling-serve command, or Docker Compose.
Which compute engine runs the jobs (DOCLING_SERVE_ENG_KIND) — the in-process Local engine (default), or the Redis-backed RQ engine.

	Local engine (default)	RQ engine (Redis + workers)
`docling-serve` command	Quickstart / dev	Distributed
Docker Compose	Containerized single node (+GPU)	→ serve repo

Configuration

docling-serve is configured by CLI flags or environment variables. Precedence is environment variable > config file > defaults.

Subprocess gotcha

When uvicorn runs with --reload or --workers > 1 it spawns subprocesses, and CLI flags (e.g. --enable-ui, --artifacts-path) are ignored. Use the DOCLING_SERVE_* environment variables in those deployments.

Most common settings

Setting (env var)	What it does	Default
`UVICORN_HOST` / `UVICORN_PORT`	bind address / port	`0.0.0.0` / `5001`
`UVICORN_WORKERS`	uvicorn worker processes	`1`
`DOCLING_SERVE_API_KEY`	require an `X-Api-Key` header	unset
`DOCLING_SERVE_ENABLE_UI`	serve the Gradio demo UI at `/ui`	`false`
`DOCLING_SERVE_ARTIFACTS_PATH`	local path to pre-downloaded models	unset (auto-download)
`DOCLING_SERVE_MAX_NUM_PAGES` / `DOCLING_SERVE_MAX_FILE_SIZE`	per-request limits	unset
`DOCLING_SERVE_ENG_KIND`	async engine: `local` or `rq` (also `kfp`/`ray` — see serve repo)	`local`

See the full reference in the source repo: configuration.md and .env.example.

Docling settings (env vars)

These tune Docling itself and are read by the server too:

Env var	What it does	Default
`DOCLING_DEVICE`	inference device: `cpu` / `cuda` / `mps`	auto
`DOCLING_NUM_THREADS`	CPU threads	runtime default
`DOCLING_PERF_PAGE_BATCH_SIZE`	pages per batch	runtime default
`DOCLING_PERF_ELEMENTS_BATCH_SIZE`	elements per batch	runtime default
`DOCLING_DEBUG_PROFILE_PIPELINE_TIMINGS`	log per-stage timings	`false`

For how to choose device/perf values see GPU support. For offline / air-gapped model setup see the FAQ and Advanced options; set DOCLING_SERVE_ARTIFACTS_PATH to a pre-populated model directory.

Compute engines

docling-serve runs each conversion as an asynchronous job dispatched to a compute engine, chosen with DOCLING_SERVE_ENG_KIND:

Local (local, the default) — jobs run in an in-process thread pool inside the server. No external services; everything stays on one host. Tunable with DOCLING_SERVE_ENG_LOC_NUM_WORKERS (default 2) and DOCLING_SERVE_ENG_LOC_SHARE_MODELS (default false). Best for a single machine.
RQ (rq) — jobs are queued in Redis and executed by separate docling-serve rq-worker processes, so the API tier and the conversion workers scale independently. Best for horizontal scaling and higher throughput.
KFP / Ray — Kubeflow Pipelines and Ray engines for cluster orchestration; see the serve repo.

Running it

Simple command (Local engine — the default quickstart)

pip install "docling-serve[ui]"
docling-serve run --enable-ui      # production-style: reload off, binds 0.0.0.0, UI off by default
# docling-serve dev                # dev: auto-reload, binds 127.0.0.1, UI on (localhost only)

API at http://localhost:5001, interactive docs at /docs, demo UI at /ui. Smoke test:

curl -X POST "http://localhost:5001/v1/convert/source/async" \
  -H "Content-Type: application/json" \
  -d '{"http_sources": [{"url": "https://arxiv.org/pdf/2501.17887"}]}'

Note

The demo UI (--enable-ui / DOCLING_SERVE_ENABLE_UI) is a Gradio app; files it produces are cleared from its cache after ~10 hours. It is a demonstrator, not durable storage.

Docker Compose (incl. local GPU)

Same server, containerized. The shipped compose examples are all-in-one containers that don't set ENG_KIND, so they run the default Local engine.

# Pure CPU (no compose)
podman run -p 5001:5001 -e DOCLING_SERVE_ENABLE_UI=1 quay.io/docling-project/docling-serve

# NVIDIA GPU
docker compose -f compose-nvidia.yaml up -d

# AMD GPU
docker compose -f compose-amd.yaml up -d

Compose manifests: compose-nvidia.yaml, compose-amd.yaml.

GPU prerequisites (host side; for the Python AcceleratorOptions view see GPU support and RTX GPU):

NVIDIA — driver ≥ 550.54.14 + nvidia-container-toolkit + the nvidia container runtime.
AMD — AMDGPU/ROCm ≥ 6.3; the ROCm image is not published, build it with make docling-serve-rocm-image. Detailed GID wiring: serve repo.

Note

The compose files pin older image tags (-cu126:main, -rocm72:main) than the README image table; treat the README image table as the source of truth and adjust the image: line if needed. There is no shipped single-CPU compose file — use the podman one-liner for pure CPU.

RQ engine (distributed: Redis + separate workers)

The API enqueues jobs to Redis; conversion runs in separate docling-serve rq-worker processes.

# 1) Redis
docker run -p 6379:6379 redis:7-alpine

# 2) API server (enqueues jobs)
DOCLING_SERVE_ENG_KIND=rq \
DOCLING_SERVE_ENG_RQ_REDIS_URL=redis://localhost:6379/0 \
docling-serve run

# 3) one or more workers (do the conversion)
DOCLING_SERVE_ENG_KIND=rq \
DOCLING_SERVE_ENG_RQ_REDIS_URL=redis://localhost:6379/0 \
docling-serve rq-worker

Warning

The API alone accepts jobs but nothing runs them without at least one rq-worker. DOCLING_SERVE_ENG_RQ_REDIS_URL is required (no default) and must be identical across every API and worker process.

Cluster, production & advanced variants

These live in the docling-serve repo (run-time manifests aren't vendored here):

OpenShift — simple deployment
Multi-worker RQ on Kubernetes (Redis + worker pods + secret)
Secure deployment with oauth-proxy / TLS / Route
ReplicaSets with sticky sessions (task state is node-local)
Model-cache PVC/Job (pre-baking weights)
KFP / Ray engines, OpenTelemetry, CUDA image-tagging policy → serve repo

Prefer not to run any of this yourself? See the managed service.