Skip to content

Input Formats

Docling Graph uses a unified ingestion path: all inputs go through Docling except DoclingDocument JSON (which skips conversion), using a unified path: all inputs are converted through Docling; only DoclingDocument JSON skips conversion. See Docling supported formats for what Docling accepts.

Input Normalization Process

The pipeline automatically detects and validates input types, routing them through the appropriate processing stages:

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart TD
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    Start@{ shape: terminal, label: "Input Source" }
    Detect@{ shape: procs, label: "Input Type Detection" }

    %% Validators
    ValPDF@{ shape: lin-proc, label: "Validate PDF" }
    ValImg@{ shape: lin-proc, label: "Validate Image" }
    ValText@{ shape: lin-proc, label: "Validate Text" }
    ValMD@{ shape: lin-proc, label: "Validate MD" }
    ValDoc@{ shape: lin-proc, label: "Validate Docling" }

    %% URL Specifics
    ValURL@{ shape: lin-proc, label: "Validate & Download URL" }
    CheckDL{"Type?"}

    %% Handlers
    HandVisual@{ shape: tag-proc, label: "Visual Handler" }
    HandText@{ shape: tag-proc, label: "Text Handler" }
    HandDoc@{ shape: tag-proc, label: "Object Handler" }

    %% Outcomes
    SetFlags@{ shape: procs, label: "Set Processing Flags" }
    Output@{ shape: doc, label: "Normalized Context" }

    %% 3. Define Connections
    Start --> Detect

    %% Input Detection Routing
    Detect -- PDF --> ValPDF
    Detect -- Image --> ValImg
    Detect -- Text --> ValText
    Detect -- MD --> ValMD
    Detect -- Docling --> ValDoc
    Detect -- URL --> ValURL

    %% URL Routing (Feeds back into validators)
    ValURL --> CheckDL
    CheckDL -- PDF --> ValPDF
    CheckDL -- Image --> ValImg
    CheckDL -- Text --> ValText
    CheckDL -- MD --> ValMD

    %% Validation to Handlers (The "Happy Path")
    ValPDF & ValImg --> HandVisual
    ValText & ValMD --> HandText
    ValDoc --> HandDoc

    %% Converge Handlers to Output
    HandVisual & HandText & HandDoc --> SetFlags --> Output

    %% 4. Apply Classes
    class Start input
    class Detect,SetFlags process
    class ValPDF,ValImg,ValText,ValMD,ValURL,ValDoc process
    class HandVisual,HandText,HandDoc operator
    class CheckDL decision
    class Output output

Key behavior: - DoclingDocument JSON: Loaded directly; conversion is skipped. - All other inputs: Normalized (e.g. URL download, text to temp .md), then sent to Docling. Docling validates format; unsupported types raise Docling errors. - URLs: Downloaded to a temp file; path is passed to Docling.

Supported Input Formats

Docling Graph does not whitelist extensions. Any file or URL is sent to Docling; Docling supported formats include PDF, Office (DOCX, XLSX, PPTX), images, HTML, Markdown, LaTeX, AsciiDoc, CSV. Unsupported formats produce a Docling conversion error (e.g. ExtractionError: Conversion failed in Docling: ...).


Document inputs (files, raw text)

Any Docling-supported file, or raw text (API only). Text and .txt are normalized to markdown, then sent to Docling.

CLI: docling-graph convert document.pdf -t templates.billing_document.BillingDocument
API: Same; for raw text use source="text content" and run_pipeline(config, mode="api").


URLs

Description: Download and process documents from HTTP/HTTPS URLs.

Processing: Content is downloaded to a temporary file; the path is passed to Docling. Supported formats are those Docling supports.

Requirements: Valid http/https URL; file size under limit (default: 100MB).

CLI Example:

# PDF from URL
docling-graph convert https://example.com/invoice.pdf -t templates.billing_document.BillingDocument

# Image from URL
docling-graph convert https://example.com/scan.jpg -t templates.form.Form

# Text from URL
docling-graph convert https://example.com/notes.txt -t templates.report.Report --backend llm

Python API Example:

config = PipelineConfig(
    source="https://example.com/document.pdf",
    template="templates.billing_document.BillingDocument",
    backend="llm",
    inference="remote",
    processing_mode="many-to-one",
    output_dir="outputs",
    export_format="csv"
)

run_pipeline(config)

URL Configuration:

from docling_graph.core.input.handlers import URLInputHandler

# Custom timeout and size limit
handler = URLInputHandler(
    timeout=30,      # seconds
    max_size_mb=50   # megabytes
)


Plain text strings (Python API only)

Raw text: pass a string as source and call run_pipeline(config, mode="api"). It is normalized to a temporary markdown file and sent to Docling. CLI does not accept plain text (file path or URL only).


DoclingDocument JSON (skip conversion)

Description: Pre-processed DoclingDocument JSON files.

File Extensions: .json (with DoclingDocument schema)

Processing: Skips document conversion. Uses pre-existing structure.

Use Cases: - Reprocessing previously converted documents - Custom document preprocessing pipelines - Integration with external Docling workflows

Requirements: - Valid DoclingDocument JSON schema - Must include schema_name: "DoclingDocument" - Must include version field

CLI Example:

docling-graph convert processed_document.json -t templates.custom.Custom

Python API Example:

config = PipelineConfig(
    source="preprocessed.json",
    template="templates.custom.Custom",
    backend="llm",
    inference="local",
    processing_mode="many-to-one",
    output_dir="outputs",
    export_format="csv"
)

run_pipeline(config)

DoclingDocument JSON Structure:

{
  "schema_name": "DoclingDocument",
  "version": "1.0.0",
  "name": "document_name",
  "pages": {
    "0": {
      "page_no": 0,
      "size": {"width": 612, "height": 792}
    }
  },
  "body": {
    "self_ref": "#/body",
    "children": []
  },
  "furniture": {}
}


Input Format Detection

  • URL: String starting with http:// or https://.
  • DoclingDocument: .json file with DoclingDocument schema (e.g. schema_name, version, pages).
  • Document: Everything else (any file path or, in API mode, raw text). Passed to Docling; no extension whitelist in docling-graph.

Processing Pipeline by Input Type

All inputs except DoclingDocument

Input → Normalize (e.g. URL download, text → .md) → Docling conversion →
DoclingDocument → Chunking → Extraction → Graph → Export

DoclingDocument JSON

Input → Load DoclingDocument → Chunking / Extraction → Graph → Export
(Conversion skipped)

Backend Compatibility

Input type LLM Backend VLM Backend
Documents (files, URLs) Yes Yes (PDF/images at Docling level)
DoclingDocument JSON Yes Yes
Plain text (API) Yes Converted via Docling

VLM backend only supports certain inputs at the Docling level (e.g. PDF, images). Other formats may raise Docling or backend errors.


Error Handling

Unsupported format (from Docling)

When the file type is not supported by Docling:

ExtractionError: Conversion failed in Docling: ...
Details: source=/path/to/file.xyz
Use a Docling-supported format or convert the file first.

Empty text

ValidationError: Text input is empty — ensure content is non-empty.

File not found (CLI)

ConfigurationError: File not found — use a valid file path or URL.

Invalid URL

ValidationError: URL must use http or https scheme


Best Practices

👍 Choose the Right Backend

  • PDFs and Images: Use VLM for complex layouts, LLM for text-heavy documents
  • Text Files: Always use LLM backend
  • Mixed Workflows: Use LLM backend for maximum compatibility

👍 Validate Input Files

from pathlib import Path

source_path = Path("document.txt")
if not source_path.exists():
    raise FileNotFoundError(f"Input file not found: {source_path}")

if source_path.stat().st_size == 0:
    raise ValueError("Input file is empty")

👍 Handle URLs Safely

from docling_graph.core.input.validators import URLValidator

validator = URLValidator()
try:
    validator.validate(url)
except ValidationError as e:
    print(f"Invalid URL: {e.message}")

👍 Use Appropriate Processing Modes

  • one-to-one: Best for multi-page PDFs where each page is independent
  • many-to-one: Best for text files and single-entity documents

Troubleshooting

🐛 Plain text input is only supported via Python API

Cause: Trying to pass plain text string via CLI

Solution: Use Python API or save text to a .txt file first

# Option 1: Use Python API
run_pipeline(config, mode="api")

# Option 2: Save to file
Path("temp.txt").write_text(text_content)
config.source = "temp.txt"
run_pipeline(config, mode="cli")

🐛 VLM backend does not support text-only inputs

Cause: Using VLM backend with text files

Solution: Switch to LLM backend

docling-graph convert notes.txt -t templates.Report --backend llm

🐛 URL download timeout

Cause: Slow network or large file

Solution: Increase timeout or download manually

from docling_graph.core.input.handlers import URLInputHandler

handler = URLInputHandler(timeout=60)  # 60 seconds
temp_path = handler.load(url)

Next Steps