⚡ RTX GPU Acceleration
Whether you're an AI enthusiast, researcher, or developer working with document processing, this guide will help you unlock the full potential of your NVIDIA RTX GPU with Docling.
By leveraging GPU acceleration, you can achieve up to 6x speedup compared to CPU-only processing. This dramatic performance improvement makes GPU acceleration especially valuable for processing large batches of documents, handling high-throughput document conversion workflows, or experimenting with advanced document understanding models.
Prerequisites
Before setting up GPU acceleration, ensure you have:
- An NVIDIA RTX GPU (RTX 40/50 series)
- Windows 10/11 or Linux operating system
Installation Steps
1. Install NVIDIA GPU Drivers
First, ensure you have the latest NVIDIA GPU drivers installed:
- Windows: Download from NVIDIA Driver Downloads
- Linux: Use your distribution's package manager or download from NVIDIA
Verify the installation:
nvidia-smi
This command should display your GPU information and driver version.
2. Install CUDA Toolkit
CUDA is NVIDIA's parallel computing platform required for GPU acceleration.
Follow the official installation guide for your operating system at NVIDIA CUDA Downloads. The installer will guide you through the process and automatically set up the required environment variables.
3. Install cuDNN
cuDNN provides optimized implementations for deep learning operations.
Follow the official installation guide at NVIDIA cuDNN Downloads. The guide provides detailed instructions for all supported platforms.
4. Install PyTorch with CUDA Support
To use GPU acceleration with Docling, you need to install PyTorch with CUDA support using the special extra-index-url:
# For CUDA 12.8 (current default for PyTorch)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# For CUDA 13.0
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
Note
The --index-url parameter is crucial as it ensures you get the CUDA-enabled version of PyTorch instead of the CPU-only version.
For other CUDA versions and installation options, refer to the PyTorch Installation Matrix.
Verify PyTorch CUDA installation:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
5. Install and Run Docling
Install Docling with all dependencies:
pip install docling
That's it! Docling will automatically detect and use your RTX GPU when available. No additional configuration is required for basic usage.
from docling.document_converter import DocumentConverter
# Docling automatically uses GPU when available
converter = DocumentConverter()
result = converter.convert("document.pdf")
Advanced: Tuning GPU Performance
For optimal GPU performance with large document batches, you can adjust batch sizes and explicitly configure the accelerator:from docling.document_converter import DocumentConverter
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
# Explicitly configure GPU acceleration
accelerator_options = AcceleratorOptions(
device=AcceleratorDevice.CUDA, # Use CUDA for NVIDIA GPUs
)
# Configure pipeline for optimal GPU performance
pipeline_options = ThreadedPdfPipelineOptions(
ocr_batch_size=64, # Increase batch size for GPU
layout_batch_size=64, # Increase batch size for GPU
table_batch_size=4,
)
# Create converter with custom settings
converter = DocumentConverter(
accelerator_options=accelerator_options,
pipeline_options=pipeline_options,
)
# Convert documents
result = converter.convert("document.pdf")
GPU-Accelerated VLM Pipeline
For maximum performance with Vision Language Models (VLM), you can run a local inference server on your RTX GPU. This approach provides significantly better throughput than inline VLM processing.
Linux: Using vLLM (Recommended)
vLLM provides the best performance for GPU-accelerated VLM inference. Start the vLLM server with optimized parameters:
vllm serve ibm-granite/granite-docling-258M \
--host 127.0.0.1 --port 8000 \
--max-num-seqs 512 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.9
Windows: Using llama-server
On Windows, you can use llama-server from llama.cpp for GPU-accelerated VLM inference:
Installation
- Download the latest llama.cpp release from the GitHub releases page
- Extract the archive and locate
llama-server.exe
Launch Command
llama-server.exe `
--hf-repo ibm-granite/granite-docling-258M-GGUF `
-cb `
-ngl -1 `
--port 8000 `
--context-shift `
-np 16 -c 131072
Performance Comparison
vLLM delivers approximately 4x better performance compared to llama-server. For Windows users seeking maximum performance, consider running vLLM via WSL2 (Windows Subsystem for Linux). See vLLM on RTX 5090 via Docker for detailed WSL2 setup instructions.
Configure Docling for VLM Server
Once your inference server is running, configure Docling to use it:
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel.settings import settings
BATCH_SIZE = 64
# Configure VLM options
vlm_options = vlm_model_specs.GRANITEDOCLING_VLLM_API
vlm_options.concurrency = BATCH_SIZE
# when running with llama.cpp (llama-server), use the different model name.
# vlm_options.params["model"] = "ibm-granite_granite-docling-258M-GGUF_granite-docling-258M-BF16.gguf"
# Set page batch size to match or exceed concurrency
settings.perf.page_batch_size = BATCH_SIZE
# Create converter with VLM pipeline
converter = DocumentConverter(
pipeline_options=vlm_options,
)
For more details on VLM pipeline configuration, see the GPU Support Guide.
Performance Optimization Tips
Batch Size Tuning
Adjust batch sizes based on your GPU memory:
- RTX 5090 (32GB): Use batch sizes of 64-128
- RTX 4090 (24GB): Use batch sizes of 32-64
- RTX 5070 (12GB): Use batch sizes of 16-32
Memory Management
Monitor GPU memory usage:
import torch
# Check GPU memory
if torch.cuda.is_available():
print(f"GPU Memory allocated: {torch.cuda.memory_allocated(0) / 1024**3:.2f} GB")
print(f"GPU Memory reserved: {torch.cuda.memory_reserved(0) / 1024**3:.2f} GB")
Troubleshooting
CUDA Out of Memory
If you encounter out-of-memory errors:
- Reduce batch sizes in
pipeline_options - Process fewer documents concurrently
- Clear GPU cache between batches:
import torch
torch.cuda.empty_cache()
CUDA Not Available
If torch.cuda.is_available() returns False:
- Verify NVIDIA drivers are installed:
nvidia-smi - Check CUDA installation:
nvcc --version - Reinstall PyTorch with correct CUDA version
- Ensure your GPU is CUDA-compatible
Performance Not Improving
If GPU acceleration doesn't improve performance:
- Increase batch sizes (if memory allows)
- Ensure you're processing enough documents to benefit from GPU parallelization
- Check GPU utilization:
nvidia-smi -l 1 - Verify PyTorch is using GPU:
torch.cuda.is_available()