Extractors API¶
Overview¶
Document extraction strategies and backends.
Module: docling_graph.core.extractors
Recent Improvements
- Model Capability Detection: Automatic tier detection and adaptive prompting
- Chain of Density: Multi-turn consolidation for ADVANCED tier models
- Zero Data Loss: Returns partial models instead of empty results on failures
- Real Tokenizers: Accurate token counting with 20% safety margins
- Enhanced GPU Cleanup: Better memory management for VLM backends
Extraction Strategies¶
OneToOne¶
Per-page extraction strategy.
class OneToOne(ExtractorProtocol):
"""Extract data from each page separately."""
def __init__(self, backend: Backend):
"""Initialize with backend."""
self.backend = backend
def extract(
self,
source: str,
template: Type[BaseModel]
) -> List[BaseModel]:
"""
Extract from each page.
Returns:
List of models (one per page)
"""
Use Cases: - Multi-page documents with independent content - Page-level analysis - Parallel processing
Example:
from docling_graph.core.extractors import OneToOne
from docling_graph.core.extractors.backends import LLMBackend
backend = LLMBackend(model="llama-3.1-8b")
extractor = OneToOne(backend=backend)
results = extractor.extract("document.pdf", MyTemplate)
print(f"Extracted {len(results)} pages")
ManyToOne¶
Consolidated extraction strategy with zero data loss.
class ManyToOne(ExtractorProtocol):
"""Extract and consolidate data from entire document."""
def __init__(
self,
backend: Backend,
use_chunking: bool = True,
):
"""Initialize with backend and options."""
self.backend = backend
self.use_chunking = use_chunking
def extract(
self,
source: str,
template: Type[BaseModel]
) -> List[BaseModel]:
"""
Extract and consolidate.
Returns:
List with single consolidated model (success)
or multiple partial models (merge failure - zero data loss)
"""
Use Cases: - Single entity across document - Consolidated information - Summary extraction
Features: - Zero Data Loss: Returns partial models if consolidation fails - Consolidation: Programmatic merge of chunk results - Schema-Aware Chunking: Dynamically adjusts chunk size based on schema
Example:
from docling_graph.core.extractors import ManyToOne
from docling_graph.core.extractors.backends import LLMBackend
backend = LLMBackend(model="llama-3.1-8b")
extractor = ManyToOne(
backend=backend,
use_chunking=True,
)
results = extractor.extract("document.pdf", MyTemplate)
# Check if consolidation succeeded
if len(results) == 1:
print(f"✅ Consolidated model: {results[0]}")
else:
print(f"⚠ Got {len(results)} partial models (data preserved)")
Backends¶
LLMBackend¶
LLM-based extraction backend with adaptive prompting.
class LLMBackend(TextExtractionBackendProtocol):
"""LLM backend for text extraction."""
def __init__(
self,
client: LLMClientProtocol,
model: str,
provider: str
):
"""Initialize LLM backend."""
self.client = client
self.model_capability = self._detect_capability() # Auto-detect tier
Methods:
extract_from_markdown(markdown, template, context, is_partial)- Extract from markdown with adaptive promptingconsolidate_from_pydantic_models(raw_models, programmatic_model, template)- Consolidate models (uses Chain of Density for ADVANCED tier)cleanup()- Clean up resources
Model Capability Tiers:
| Tier | Model Size | Prompt Style | Consolidation |
|---|---|---|---|
| SIMPLE | 1B-7B | Minimal | Single-turn |
| STANDARD | 7B-13B | Balanced | Single-turn |
| ADVANCED | 13B+ | Detailed | Chain of Density (3 turns) |
Example:
from docling_graph.core.extractors.backends import LLMBackend
from docling_graph.llm_clients import get_client
from docling_graph.llm_clients.config import resolve_effective_model_config
# STANDARD tier model (7B-13B)
effective = resolve_effective_model_config("ollama", "llama3.1:8b")
client = get_client("ollama")(model_config=effective)
backend = LLMBackend(llm_client=client)
# Automatically uses STANDARD tier prompts
model = backend.extract_from_markdown(
markdown=markdown,
template=MyTemplate,
context="full document",
is_partial=False
)
VLMBackend¶
Vision-Language Model backend with enhanced GPU cleanup.
class VLMBackend(ExtractionBackendProtocol):
"""VLM backend for document extraction."""
def __init__(self, model: str):
"""Initialize VLM backend."""
self.model_name = model
self.model = None # Loaded on first use
Methods:
extract_from_document(source, template)- Extract from documentcleanup()- Enhanced GPU memory cleanup
Enhanced GPU Cleanup:
The cleanup() method now includes:
- Model-to-CPU transfer before deletion
- Explicit CUDA cache clearing
- Memory usage tracking and logging
- Multi-GPU device support
Example:
from docling_graph.core.extractors.backends import VLMBackend
backend = VLMBackend(model_name="numind/NuExtract-2.0-8B")
try:
models = backend.extract_from_document("document.pdf", MyTemplate)
finally:
backend.cleanup() # Properly releases GPU memory
Document Processing¶
DocumentProcessor¶
Handles document conversion and markdown extraction.
class DocumentProcessor(DocumentProcessorProtocol):
"""Process documents with Docling."""
def convert_to_docling_doc(self, source: str) -> Any:
"""Convert to Docling document."""
def extract_full_markdown(self, document: Any) -> str:
"""Extract full markdown."""
def extract_page_markdowns(self, document: Any) -> List[str]:
"""Extract per-page markdown."""
Chunking¶
DocumentChunker¶
Handles document chunking with real tokenizers and schema-aware sizing.
class DocumentChunker:
"""Chunk documents for processing."""
def __init__(
self,
provider: str,
max_tokens: int = None,
tokenizer_name: str = None,
schema_json: str | None = None
):
"""
Initialize chunker.
Args:
provider: LLM provider (for tokenizer selection)
max_tokens: Maximum tokens per chunk
tokenizer_name: Specific tokenizer to use
schema_json: Schema JSON string for dynamic adjustment
"""
def chunk_markdown(
self,
markdown: str,
max_tokens: int
) -> List[str]:
"""
Chunk markdown by tokens using real tokenizer.
Args:
markdown: Markdown content
max_tokens: Maximum tokens per chunk
Returns:
List of markdown chunks
"""
def update_schema_config(self, schema_json: str):
"""
Update schema configuration dynamically.
Args:
schema_json: New schema JSON string
"""
Features:
- Real Tokenizers: Uses provider-specific tokenizers for accurate token counting
- Safety Margins: Reserves a fixed 100-token buffer for protocol overhead
- Schema-Aware: Dynamically adjusts chunk size based on exact prompt tokens
- Provider-Specific: Optimized for each LLM provider
Example:
import json
from docling_graph.core.extractors import DocumentChunker
# Create chunker with real tokenizer
chunker = DocumentChunker(
provider="mistral",
max_tokens=4096,
schema_json=json.dumps(MyTemplate.model_json_schema())
)
# Chunk with accurate token counting
chunks = chunker.chunk_markdown(markdown, max_tokens=4096)
# Update for different schema
chunker.update_schema_config(schema_json=json.dumps(OtherTemplate.model_json_schema()))
Factory¶
ExtractorFactory.create_extractor()¶
Creates an extractor from pipeline configuration. Used internally by the pipeline; for programmatic use, import from docling_graph.core.extractors.
from docling_graph.core.extractors import ExtractorFactory
extractor = ExtractorFactory.create_extractor(
processing_mode="many-to-one",
backend_name="llm",
extraction_contract="direct", # or "staged" / "delta" (LLM + many-to-one only)
staged_config=None, # optional: pass_retries, parallel_workers, nodes_fill_cap, id_shard_size, delta_* for delta
llm_client=client,
docling_config="ocr",
)
Parameters:
| Parameter | Type | Description |
|---|---|---|
processing_mode |
"one-to-one" | "many-to-one" |
Extraction strategy |
backend_name |
"llm" | "vlm" |
Backend type |
extraction_contract |
"direct" | "staged" | "delta" |
LLM contract; staged and delta only apply to many-to-one |
staged_config |
dict | None |
Optional staged tuning (pass_retries, parallel_workers, etc.) |
model_name |
str | None |
Required for VLM |
llm_client |
LLMClientProtocol | None |
Required for LLM |
docling_config |
str |
"ocr" or "vision" |
Returns: BaseExtractor instance.
Features¶
Zero Data Loss¶
Returns partial models instead of empty results:
results = extractor.extract("document.pdf", MyTemplate)
if len(results) == 1:
# Success: merged model
model = results[0]
else:
# Partial: multiple models (data preserved!)
for model in results:
process_partial(model)
Real Tokenizer Integration¶
Accurate token counting with safety margins:
chunker = DocumentChunker(
provider="mistral",
max_tokens=4096 # Uses real Mistral tokenizer
)
# Applies 20% safety margin automatically
Related APIs¶
- Staged Extraction - Multi-pass extraction
- Delta Extraction - Chunk-based graph extraction
- Extraction Process - Usage guide
- Model Merging - Zero data loss
- Protocols - Backend protocols
- Custom Backends - Create backends