Input Formats¶
Docling Graph uses a unified ingestion path: all inputs go through Docling except DoclingDocument JSON (which skips conversion), using a unified path: all inputs are converted through Docling; only DoclingDocument JSON skips conversion. See Docling supported formats for what Docling accepts.
Input Normalization Process¶
The pipeline automatically detects and validates input types, routing them through the appropriate processing stages:
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart TD
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
Start@{ shape: terminal, label: "Input Source" }
Detect@{ shape: procs, label: "Input Type Detection" }
%% Validators
ValPDF@{ shape: lin-proc, label: "Validate PDF" }
ValImg@{ shape: lin-proc, label: "Validate Image" }
ValText@{ shape: lin-proc, label: "Validate Text" }
ValMD@{ shape: lin-proc, label: "Validate MD" }
ValDoc@{ shape: lin-proc, label: "Validate Docling" }
%% URL Specifics
ValURL@{ shape: lin-proc, label: "Validate & Download URL" }
CheckDL{"Type?"}
%% Handlers
HandVisual@{ shape: tag-proc, label: "Visual Handler" }
HandText@{ shape: tag-proc, label: "Text Handler" }
HandDoc@{ shape: tag-proc, label: "Object Handler" }
%% Outcomes
SetFlags@{ shape: procs, label: "Set Processing Flags" }
Output@{ shape: doc, label: "Normalized Context" }
%% 3. Define Connections
Start --> Detect
%% Input Detection Routing
Detect -- PDF --> ValPDF
Detect -- Image --> ValImg
Detect -- Text --> ValText
Detect -- MD --> ValMD
Detect -- Docling --> ValDoc
Detect -- URL --> ValURL
%% URL Routing (Feeds back into validators)
ValURL --> CheckDL
CheckDL -- PDF --> ValPDF
CheckDL -- Image --> ValImg
CheckDL -- Text --> ValText
CheckDL -- MD --> ValMD
%% Validation to Handlers (The "Happy Path")
ValPDF & ValImg --> HandVisual
ValText & ValMD --> HandText
ValDoc --> HandDoc
%% Converge Handlers to Output
HandVisual & HandText & HandDoc --> SetFlags --> Output
%% 4. Apply Classes
class Start input
class Detect,SetFlags process
class ValPDF,ValImg,ValText,ValMD,ValURL,ValDoc process
class HandVisual,HandText,HandDoc operator
class CheckDL decision
class Output output
Key behavior: - DoclingDocument JSON: Loaded directly; conversion is skipped. - All other inputs: Normalized (e.g. URL download, text to temp .md), then sent to Docling. Docling validates format; unsupported types raise Docling errors. - URLs: Downloaded to a temp file; path is passed to Docling.
Supported Input Formats¶
Docling Graph does not whitelist extensions. Any file or URL is sent to Docling; Docling supported formats include PDF, Office (DOCX, XLSX, PPTX), images, HTML, Markdown, LaTeX, AsciiDoc, CSV. Unsupported formats produce a Docling conversion error (e.g. ExtractionError: Conversion failed in Docling: ...).
Document inputs (files, raw text)¶
Any Docling-supported file, or raw text (API only). Text and .txt are normalized to markdown, then sent to Docling.
CLI: docling-graph convert document.pdf -t templates.billing_document.BillingDocument
API: Same; for raw text use source="text content" and run_pipeline(config, mode="api").
URLs¶
Description: Download and process documents from HTTP/HTTPS URLs.
Processing: Content is downloaded to a temporary file; the path is passed to Docling. Supported formats are those Docling supports.
Requirements: Valid http/https URL; file size under limit (default: 100MB).
CLI Example:
# PDF from URL
docling-graph convert https://example.com/invoice.pdf -t templates.billing_document.BillingDocument
# Image from URL
docling-graph convert https://example.com/scan.jpg -t templates.form.Form
# Text from URL
docling-graph convert https://example.com/notes.txt -t templates.report.Report --backend llm
Python API Example:
config = PipelineConfig(
source="https://example.com/document.pdf",
template="templates.billing_document.BillingDocument",
backend="llm",
inference="remote",
processing_mode="many-to-one",
output_dir="outputs",
export_format="csv"
)
run_pipeline(config)
URL Configuration:
from docling_graph.core.input.handlers import URLInputHandler
# Custom timeout and size limit
handler = URLInputHandler(
timeout=30, # seconds
max_size_mb=50 # megabytes
)
Plain text strings (Python API only)¶
Raw text: pass a string as source and call run_pipeline(config, mode="api"). It is normalized to a temporary markdown file and sent to Docling. CLI does not accept plain text (file path or URL only).
DoclingDocument JSON (skip conversion)¶
Description: Pre-processed DoclingDocument JSON files.
File Extensions: .json (with DoclingDocument schema)
Processing: Skips document conversion. Uses pre-existing structure.
Use Cases: - Reprocessing previously converted documents - Custom document preprocessing pipelines - Integration with external Docling workflows
Requirements:
- Valid DoclingDocument JSON schema
- Must include schema_name: "DoclingDocument"
- Must include version field
CLI Example:
Python API Example:
config = PipelineConfig(
source="preprocessed.json",
template="templates.custom.Custom",
backend="llm",
inference="local",
processing_mode="many-to-one",
output_dir="outputs",
export_format="csv"
)
run_pipeline(config)
DoclingDocument JSON Structure:
{
"schema_name": "DoclingDocument",
"version": "1.0.0",
"name": "document_name",
"pages": {
"0": {
"page_no": 0,
"size": {"width": 612, "height": 792}
}
},
"body": {
"self_ref": "#/body",
"children": []
},
"furniture": {}
}
Input Format Detection¶
- URL: String starting with
http://orhttps://. - DoclingDocument:
.jsonfile with DoclingDocument schema (e.g.schema_name,version,pages). - Document: Everything else (any file path or, in API mode, raw text). Passed to Docling; no extension whitelist in docling-graph.
Processing Pipeline by Input Type¶
All inputs except DoclingDocument¶
Input → Normalize (e.g. URL download, text → .md) → Docling conversion →
DoclingDocument → Chunking → Extraction → Graph → Export
DoclingDocument JSON¶
Backend Compatibility¶
| Input type | LLM Backend | VLM Backend |
|---|---|---|
| Documents (files, URLs) | Yes | Yes (PDF/images at Docling level) |
| DoclingDocument JSON | Yes | Yes |
| Plain text (API) | Yes | Converted via Docling |
VLM backend only supports certain inputs at the Docling level (e.g. PDF, images). Other formats may raise Docling or backend errors.
Error Handling¶
Unsupported format (from Docling)¶
When the file type is not supported by Docling:
Use a Docling-supported format or convert the file first.Empty text¶
ValidationError: Text input is empty — ensure content is non-empty.
File not found (CLI)¶
ConfigurationError: File not found — use a valid file path or URL.
Invalid URL¶
ValidationError: URL must use http or https scheme
Best Practices¶
👍 Choose the Right Backend¶
- PDFs and Images: Use VLM for complex layouts, LLM for text-heavy documents
- Text Files: Always use LLM backend
- Mixed Workflows: Use LLM backend for maximum compatibility
👍 Validate Input Files¶
from pathlib import Path
source_path = Path("document.txt")
if not source_path.exists():
raise FileNotFoundError(f"Input file not found: {source_path}")
if source_path.stat().st_size == 0:
raise ValueError("Input file is empty")
👍 Handle URLs Safely¶
from docling_graph.core.input.validators import URLValidator
validator = URLValidator()
try:
validator.validate(url)
except ValidationError as e:
print(f"Invalid URL: {e.message}")
👍 Use Appropriate Processing Modes¶
- one-to-one: Best for multi-page PDFs where each page is independent
- many-to-one: Best for text files and single-entity documents
Troubleshooting¶
🐛 Plain text input is only supported via Python API¶
Cause: Trying to pass plain text string via CLI
Solution: Use Python API or save text to a .txt file first
# Option 1: Use Python API
run_pipeline(config, mode="api")
# Option 2: Save to file
Path("temp.txt").write_text(text_content)
config.source = "temp.txt"
run_pipeline(config, mode="cli")
🐛 VLM backend does not support text-only inputs¶
Cause: Using VLM backend with text files
Solution: Switch to LLM backend
🐛 URL download timeout¶
Cause: Slow network or large file
Solution: Increase timeout or download manually
from docling_graph.core.input.handlers import URLInputHandler
handler = URLInputHandler(timeout=60) # 60 seconds
temp_path = handler.load(url)
Next Steps¶
- Backend Selection - Choose the right backend for your input
- Processing Modes - Understand one-to-one vs many-to-one
- Configuration Examples - See complete configuration examples