Skip to content

Extractors API

Overview

Document extraction strategies and backends.

Module: docling_graph.core.extractors

Recent Improvements

  • Model Capability Detection: Automatic tier detection and adaptive prompting
  • Chain of Density: Multi-turn consolidation for ADVANCED tier models
  • Zero Data Loss: Returns partial models instead of empty results on failures
  • Real Tokenizers: Accurate token counting with 20% safety margins
  • Enhanced GPU Cleanup: Better memory management for VLM backends

Extraction Strategies

OneToOne

Per-page extraction strategy.

class OneToOne(ExtractorProtocol):
    """Extract data from each page separately."""

    def __init__(self, backend: Backend):
        """Initialize with backend."""
        self.backend = backend

    def extract(
        self,
        source: str,
        template: Type[BaseModel]
    ) -> List[BaseModel]:
        """
        Extract from each page.

        Returns:
            List of models (one per page)
        """

Use Cases: - Multi-page documents with independent content - Page-level analysis - Parallel processing

Example:

from docling_graph.core.extractors import OneToOne
from docling_graph.core.extractors.backends import LLMBackend

backend = LLMBackend(model="llama-3.1-8b")
extractor = OneToOne(backend=backend)

results = extractor.extract("document.pdf", MyTemplate)
print(f"Extracted {len(results)} pages")

ManyToOne

Consolidated extraction strategy with zero data loss.

class ManyToOne(ExtractorProtocol):
    """Extract and consolidate data from entire document."""

    def __init__(
        self,
        backend: Backend,
        use_chunking: bool = True,
    ):
        """Initialize with backend and options."""
        self.backend = backend
        self.use_chunking = use_chunking

    def extract(
        self,
        source: str,
        template: Type[BaseModel]
    ) -> List[BaseModel]:
        """
        Extract and consolidate.

        Returns:
            List with single consolidated model (success)
            or multiple partial models (merge failure - zero data loss)
        """

Use Cases: - Single entity across document - Consolidated information - Summary extraction

Features: - Zero Data Loss: Returns partial models if consolidation fails - Consolidation: Programmatic merge of chunk results - Schema-Aware Chunking: Dynamically adjusts chunk size based on schema

Example:

from docling_graph.core.extractors import ManyToOne
from docling_graph.core.extractors.backends import LLMBackend

backend = LLMBackend(model="llama-3.1-8b")
extractor = ManyToOne(
    backend=backend,
    use_chunking=True,
)

results = extractor.extract("document.pdf", MyTemplate)

# Check if consolidation succeeded
if len(results) == 1:
    print(f"✅ Consolidated model: {results[0]}")
else:
    print(f"⚠ Got {len(results)} partial models (data preserved)")

Backends

LLMBackend

LLM-based extraction backend with adaptive prompting.

class LLMBackend(TextExtractionBackendProtocol):
    """LLM backend for text extraction."""

    def __init__(
        self,
        client: LLMClientProtocol,
        model: str,
        provider: str
    ):
        """Initialize LLM backend."""
        self.client = client
        self.model_capability = self._detect_capability()  # Auto-detect tier

Methods:

  • extract_from_markdown(markdown, template, context, is_partial) - Extract from markdown with adaptive prompting
  • consolidate_from_pydantic_models(raw_models, programmatic_model, template) - Consolidate models (uses Chain of Density for ADVANCED tier)
  • cleanup() - Clean up resources

Model Capability Tiers:

Tier Model Size Prompt Style Consolidation
SIMPLE 1B-7B Minimal Single-turn
STANDARD 7B-13B Balanced Single-turn
ADVANCED 13B+ Detailed Chain of Density (3 turns)

Example:

from docling_graph.core.extractors.backends import LLMBackend
from docling_graph.llm_clients import get_client
from docling_graph.llm_clients.config import resolve_effective_model_config

# STANDARD tier model (7B-13B)
effective = resolve_effective_model_config("ollama", "llama3.1:8b")
client = get_client("ollama")(model_config=effective)
backend = LLMBackend(llm_client=client)

# Automatically uses STANDARD tier prompts
model = backend.extract_from_markdown(
    markdown=markdown,
    template=MyTemplate,
    context="full document",
    is_partial=False
)

VLMBackend

Vision-Language Model backend with enhanced GPU cleanup.

class VLMBackend(ExtractionBackendProtocol):
    """VLM backend for document extraction."""

    def __init__(self, model: str):
        """Initialize VLM backend."""
        self.model_name = model
        self.model = None  # Loaded on first use

Methods:

  • extract_from_document(source, template) - Extract from document
  • cleanup() - Enhanced GPU memory cleanup

Enhanced GPU Cleanup:

The cleanup() method now includes: - Model-to-CPU transfer before deletion - Explicit CUDA cache clearing - Memory usage tracking and logging - Multi-GPU device support

Example:

from docling_graph.core.extractors.backends import VLMBackend

backend = VLMBackend(model_name="numind/NuExtract-2.0-8B")

try:
    models = backend.extract_from_document("document.pdf", MyTemplate)
finally:
    backend.cleanup()  # Properly releases GPU memory

Document Processing

DocumentProcessor

Handles document conversion and markdown extraction.

class DocumentProcessor(DocumentProcessorProtocol):
    """Process documents with Docling."""

    def convert_to_docling_doc(self, source: str) -> Any:
        """Convert to Docling document."""

    def extract_full_markdown(self, document: Any) -> str:
        """Extract full markdown."""

    def extract_page_markdowns(self, document: Any) -> List[str]:
        """Extract per-page markdown."""

Chunking

DocumentChunker

Handles document chunking with real tokenizers and schema-aware sizing.

class DocumentChunker:
    """Chunk documents for processing."""

    def __init__(
        self,
        provider: str,
        max_tokens: int = None,
        tokenizer_name: str = None,
        schema_json: str | None = None
    ):
        """
        Initialize chunker.

        Args:
            provider: LLM provider (for tokenizer selection)
            max_tokens: Maximum tokens per chunk
            tokenizer_name: Specific tokenizer to use
            schema_json: Schema JSON string for dynamic adjustment
        """

    def chunk_markdown(
        self,
        markdown: str,
        max_tokens: int
    ) -> List[str]:
        """
        Chunk markdown by tokens using real tokenizer.

        Args:
            markdown: Markdown content
            max_tokens: Maximum tokens per chunk

        Returns:
            List of markdown chunks
        """

    def update_schema_config(self, schema_json: str):
        """
        Update schema configuration dynamically.

        Args:
            schema_json: New schema JSON string
        """

Features:

  • Real Tokenizers: Uses provider-specific tokenizers for accurate token counting
  • Safety Margins: Reserves a fixed 100-token buffer for protocol overhead
  • Schema-Aware: Dynamically adjusts chunk size based on exact prompt tokens
  • Provider-Specific: Optimized for each LLM provider

Example:

import json

from docling_graph.core.extractors import DocumentChunker

# Create chunker with real tokenizer
chunker = DocumentChunker(
    provider="mistral",
    max_tokens=4096,
    schema_json=json.dumps(MyTemplate.model_json_schema())
)

# Chunk with accurate token counting
chunks = chunker.chunk_markdown(markdown, max_tokens=4096)

# Update for different schema
chunker.update_schema_config(schema_json=json.dumps(OtherTemplate.model_json_schema()))

Factory

ExtractorFactory.create_extractor()

Creates an extractor from pipeline configuration. Used internally by the pipeline; for programmatic use, import from docling_graph.core.extractors.

from docling_graph.core.extractors import ExtractorFactory

extractor = ExtractorFactory.create_extractor(
    processing_mode="many-to-one",
    backend_name="llm",
    extraction_contract="direct",  # or "staged" / "delta" (LLM + many-to-one only)
    staged_config=None,            # optional: pass_retries, parallel_workers, nodes_fill_cap, id_shard_size, delta_* for delta
    llm_client=client,
    docling_config="ocr",
)

Parameters:

Parameter Type Description
processing_mode "one-to-one" | "many-to-one" Extraction strategy
backend_name "llm" | "vlm" Backend type
extraction_contract "direct" | "staged" | "delta" LLM contract; staged and delta only apply to many-to-one
staged_config dict | None Optional staged tuning (pass_retries, parallel_workers, etc.)
model_name str | None Required for VLM
llm_client LLMClientProtocol | None Required for LLM
docling_config str "ocr" or "vision"

Returns: BaseExtractor instance.


Features

Zero Data Loss

Returns partial models instead of empty results:

results = extractor.extract("document.pdf", MyTemplate)

if len(results) == 1:
    # Success: merged model
    model = results[0]
else:
    # Partial: multiple models (data preserved!)
    for model in results:
        process_partial(model)

Real Tokenizer Integration

Accurate token counting with safety margins:

chunker = DocumentChunker(
    provider="mistral",
    max_tokens=4096  # Uses real Mistral tokenizer
)
# Applies 20% safety margin automatically