Configuration API¶

Overview¶

Type-safe configuration classes for the docling-graph pipeline.

Module: docling_graph.config

PipelineConfig¶

Main configuration class for pipeline execution.

class PipelineConfig(BaseModel):
    """Type-safe configuration for the docling-graph pipeline."""

Constructor¶

config = PipelineConfig(
    source: Union[str, Path] = "",
    template: Union[str, Type[BaseModel]] = "",
    backend: Literal["llm", "vlm"] = "llm",
    inference: Literal["local", "remote"] = "local",
    processing_mode: Literal["one-to-one", "many-to-one"] = "many-to-one",
    extraction_contract: Literal["direct", "staged", "delta"] = "direct",
    docling_config: Literal["ocr", "vision"] = "ocr",
    model_override: str | None = None,
    provider_override: str | None = None,
    models: ModelsConfig = ModelsConfig(),
    use_chunking: bool = True,
    chunk_max_tokens: int | None = None,
    debug: bool = False,
    max_batch_size: int = 1,
    dump_to_disk: bool | None = None,
    export_format: Literal["csv", "cypher"] = "csv",
    export_docling: bool = True,
    export_docling_json: bool = True,
    export_markdown: bool = True,
    export_per_page_markdown: bool = False,
    reverse_edges: bool = False,
    output_dir: Union[str, Path] = "outputs"
)

Fields¶

Required Fields¶

Field	Type	Description
`source`	`str` or `Path`	Path to source document
`template`	`str` or `Type[BaseModel]`	Pydantic template class or dotted path

Backend Configuration¶

Field	Type	Default	Description
`backend`	`"llm"` or `"vlm"`	`"llm"`	Extraction backend type
`inference`	`"local"` or `"remote"`	`"local"`	Inference location
`model_override`	`str` or `None`	`None`	Override default model
`provider_override`	`str` or `None`	`None`	Override default provider
`models`	`ModelsConfig`	`ModelsConfig()`	Model configurations

Processing Configuration¶

Field	Type	Default	Description
`processing_mode`	`"one-to-one"` or `"many-to-one"`	`"many-to-one"`	Processing strategy
`extraction_contract`	`"direct"`, `"staged"`, or `"delta"`	`"direct"`	LLM extraction contract (`staged` is optimized for weaker models in many-to-one mode)
`docling_config`	`"ocr"` or `"vision"`	`"ocr"`	Docling pipeline type
`use_chunking`	`bool`	`True`	Enable document chunking
`chunk_max_tokens`	`int` or `None`	`None`	Max tokens per chunk (default 512 when chunking)
`debug`	`bool`	`False`	Enable debug artifacts
`max_batch_size`	`int`	`1`	Maximum batch size

Staged Tuning Configuration¶

Field	Type	Default	Description
`staged_tuning_preset`	`"standard"` or `"advanced"`	`"standard"`	Preset for staged extraction defaults
`staged_pass_retries`	`int` or `None`	`None`	Retries per staged pass (`None` uses preset)
`parallel_workers`	`int` or `None`	`None`	Parallel workers for extraction (staged fill pass and delta batch calls; `None` uses preset for staged)
`staged_nodes_fill_cap`	`int` or `None`	`None`	Max node instances per fill-pass call (`None` uses preset)
`staged_id_shard_size`	`int` or `None`	`None`	Max paths per ID-pass call (`0` = no sharding, `None` uses preset)

Delta Configuration¶

Delta extraction uses the same parallel_workers setting (see Staged Tuning above) for parallel batch calls.

Field	Type	Default	Description
`llm_batch_token_size`	`int`	`1024`	Max input tokens per delta batch (new LLM call when exceeded)
`delta_quality_require_root`	`bool`	`True`	Require root instance in merged output
`delta_quality_min_instances`	`int`	`1`	Minimum node count threshold
`delta_quality_max_parent_lookup_miss`	`int`	`4`	Max parent-link misses before quality failure; `-1` disables this check
`delta_quality_adaptive_parent_lookup`	`bool`	`True`	Adaptive miss tolerance when root exists (e.g. up to half of instances, cap 300)
`delta_quality_require_relationships`	`bool`	`False`	Require at least one attached relationship/list linkage
`delta_quality_require_structural_attachments`	`bool`	`False`	Require structural attachments (avoid root-only outputs)
`delta_normalizer_validate_paths`	`bool`	`True`	Drop unknown/non-catalog paths
`delta_normalizer_canonicalize_ids`	`bool`	`True`	Normalize ID values for deterministic matching
`delta_normalizer_strip_nested_properties`	`bool`	`True`	Drop nested dict properties to keep flat graph payloads
`delta_normalizer_attach_provenance`	`bool`	`True`	Attach batch/chunk provenance to nodes/relationships
`delta_resolvers_enabled`	`bool`	`True`	Enable post-merge resolver pass
`delta_resolvers_mode`	`str`	`"semantic"`	Resolver mode: `off`, `fuzzy`, `semantic`, `chain`
`delta_resolver_fuzzy_threshold`	`float`	`0.9`	Fuzzy merge threshold
`delta_resolver_semantic_threshold`	`float`	`0.92`	Semantic merge threshold
`delta_resolver_properties`	`list[str]`	`[]`	Preferred properties used for resolver matching
`delta_resolver_paths`	`list[str]`	`[]`	Restrict resolver to selected catalog paths

Gleaning (direct and delta)¶

Optional second-pass extraction to improve recall. Applies to direct and delta contracts only (not staged).

Field	Type	Default	Description
`gleaning_enabled`	`bool`	`True`	Run one extra extraction pass ("what did you miss?") and merge additional entities/relations.
`gleaning_max_passes`	`int`	`1`	Max number of gleaning passes when `gleaning_enabled` is True.

Export Configuration¶

Field	Type	Default	Description
`dump_to_disk`	`bool` or `None`	`None`	Control file exports. `None`=auto-detect (CLI=True, API=False), `True`=always export, `False`=never export
`export_format`	`"csv"` or `"cypher"`	`"csv"`	Graph export format
`export_docling`	`bool`	`True`	Export Docling outputs
`export_docling_json`	`bool`	`True`	Export Docling JSON
`export_markdown`	`bool`	`True`	Export markdown
`export_per_page_markdown`	`bool`	`False`	Export per-page markdown

Graph Configuration¶

Field	Type	Default	Description
`reverse_edges`	`bool`	`False`	Create reverse edges

Output Configuration¶

Field	Type	Default	Description
`output_dir`	`str` or `Path`	`"outputs"`	Output directory path

Methods¶

run()¶

def run(self) -> None

Execute the pipeline with this configuration.

Example:

config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate"
)
run_pipeline(config)

to_dict()¶

def to_dict(self) -> Dict[str, Any]

Convert configuration to dictionary format.

Returns: Dictionary with all configuration values

Example:

config = PipelineConfig(source="doc.pdf", template="templates.MyTemplate")
config_dict = config.to_dict()
print(config_dict["backend"])  # "llm"

generate_yaml_dict()¶

@classmethod
def generate_yaml_dict(cls) -> Dict[str, Any]

Generate YAML-compatible configuration dictionary with defaults.

Returns: Dictionary suitable for YAML serialization

ModelsConfig¶

Configuration for all model types.

class ModelsConfig(BaseModel):
    """Complete models configuration."""

    llm: LLMConfig = Field(default_factory=LLMConfig)
    vlm: VLMConfig = Field(default_factory=VLMConfig)

Fields¶

Field	Type	Description
`llm`	`LLMConfig`	LLM model configuration
`vlm`	`VLMConfig`	VLM model configuration

LLMConfig¶

Configuration for LLM models.

class LLMConfig(BaseModel):
    """LLM model configurations for local and remote inference."""

    local: ModelConfig = Field(default_factory=lambda: ModelConfig(
        model="ibm-granite/granite-4.0-1b",
        provider="vllm"
    ))
    remote: ModelConfig = Field(default_factory=lambda: ModelConfig(
        model="mistral-small-latest",
        provider="mistral"
    ))

Fields¶

Field	Type	Default Model	Default Provider
`local`	`ModelConfig`	`ibm-granite/granite-4.0-1b`	`vllm`
`remote`	`ModelConfig`	`mistral-small-latest`	`mistral`

VLMConfig¶

Configuration for VLM models.

class VLMConfig(BaseModel):
    """VLM model configuration."""

    local: ModelConfig = Field(default_factory=lambda: ModelConfig(
        model="numind/NuExtract-2.0-8B",
        provider="docling"
    ))

Fields¶

Field	Type	Default Model	Default Provider
`local`	`ModelConfig`	`numind/NuExtract-2.0-8B`	`docling`

VLM inference

VLM only supports local inference.

ModelConfig¶

Configuration for a specific model.

class ModelConfig(BaseModel):
    """Configuration for a specific model."""

    model: str = Field(..., description="Model name/path")
    provider: str = Field(..., description="Provider name")

Fields¶

Field	Type	Description
`model`	`str`	Model name or path
`provider`	`str`	Provider name (e.g., "vllm", "mistral")

BackendConfig¶

Configuration for extraction backend (internal use).

class BackendConfig(BaseModel):
    """Configuration for an extraction backend."""

    provider: str = Field(..., description="Backend provider")
    model: str = Field(..., description="Model name or path")
    api_key: str | None = Field(None, description="API key")
    base_url: str | None = Field(None, description="Base URL")

ExtractorConfig¶

Configuration for extraction strategy (internal use).

class ExtractorConfig(BaseModel):
    """Configuration for the extraction strategy."""

    strategy: Literal["many-to-one", "one-to-one"] = Field(default="many-to-one")
    docling_config: Literal["ocr", "vision"] = Field(default="ocr")
    use_chunking: bool = Field(default=True)
    chunker_config: Dict[str, Any] | None = Field(default=None)

Usage Examples¶

Basic Configuration¶

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate"
)
run_pipeline(config)

Custom Backend¶

config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate",
    backend="llm",
    inference="remote",
    model_override="gpt-4-turbo",
    provider_override="openai"
)
run_pipeline(config)

Custom Processing¶

config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate",
    processing_mode="one-to-one",
    use_chunking=False
)
run_pipeline(config)

Custom Export¶

from docling_graph import run_pipeline

config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate",
    dump_to_disk=True,  # Enable file exports
    export_format="cypher",
    export_docling_json=True,
    export_markdown=True,
    export_per_page_markdown=True,
    output_dir="custom_outputs"
)

# Returns data AND writes files
context = run_pipeline(config)

API Mode (No File Exports)¶

from docling_graph import run_pipeline

config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate"
    # dump_to_disk defaults to None (auto-detects as False for API)
    # debug defaults to False (no debug artifacts)
)

# Returns data only, no file exports, no debug artifacts
context = run_pipeline(config)
graph = context.knowledge_graph

Debug Mode¶

from docling_graph import run_pipeline
from pathlib import Path

# Enable debug mode for troubleshooting
config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate",
    debug=True,  # Enable debug mode
    dump_to_disk=True  # Also save final outputs
)

context = run_pipeline(config)

# Debug artifacts saved to outputs/{document}_{timestamp}/debug/
debug_dir = Path(context.output_dir) / "debug"
print(f"Debug artifacts saved to: {debug_dir}")

# Analyze staged debug artifacts
import json

# Core staged artifacts
for name in [
    "node_catalog.json",
    "id_pass.json",
    "fill_pass.json",
    "edges_pass.json",
    "merged_output.json",
    "staged_trace.json",
]:
    p = debug_dir / name
    print(name, "exists" if p.exists() else "missing")

# Example: inspect staged trace timings
with open(debug_dir / "staged_trace.json") as f:
    trace = json.load(f)
    print(trace.get("timings_seconds", {}))

# When extraction_contract="delta", debug may include: delta_trace.json, delta_merged_graph.json, delta_merged_output.json

Explicit Control¶

from docling_graph import run_pipeline

# Force file exports in API mode
config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate",
    dump_to_disk=True,
    output_dir="outputs"
)
context = run_pipeline(config)

# Force no file exports (even if output_dir is set)
config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate",
    dump_to_disk=False,
    output_dir="outputs"  # Ignored
)
context = run_pipeline(config)

Complete Configuration¶

from docling_graph import run_pipeline, PipelineConfig, LLMConfig, ModelConfig

config = PipelineConfig(
    # Source
    source="document.pdf",
    template="templates.MyTemplate",

    # Backend
    backend="llm",
    inference="remote",
    model_override="mistral-small-latest",
    provider_override="mistral",

    # Processing
    processing_mode="many-to-one",
    docling_config="ocr",
    use_chunking=True,
    max_batch_size=5,

    # Export
    export_format="csv",
    export_docling=True,
    export_docling_json=True,
    export_markdown=True,
    export_per_page_markdown=False,

    # Graph
    reverse_edges=False,

    # Output
    output_dir="outputs/custom"
)

run_pipeline(config)

Validation¶

Automatic Validation¶

Pydantic validates all fields automatically:

# ✅ Valid
config = PipelineConfig(
    source="doc.pdf",
    template="templates.MyTemplate",
    backend="llm"
)

# ❌ Invalid - raises ValidationError
config = PipelineConfig(
    source="doc.pdf",
    template="templates.MyTemplate",
    backend="invalid"  # Not "llm" or "vlm"
)

Custom Validation¶

VLM backend validation:

# ❌ Invalid - VLM doesn't support remote
config = PipelineConfig(
    source="doc.pdf",
    template="templates.MyTemplate",
    backend="vlm",
    inference="remote"  # Raises ValueError
)

# ✅ Valid
config = PipelineConfig(
    source="doc.pdf",
    template="templates.MyTemplate",
    backend="vlm",
    inference="local"
)

Type Safety¶

Type Hints¶

All fields have proper type hints:

from docling_graph import run_pipeline, PipelineConfig
from pathlib import Path

# Type checker knows these are valid
config = PipelineConfig(
    source="doc.pdf",  # str
    template="templates.MyTemplate",  # str
    backend="llm",  # Literal["llm", "vlm"]
    use_chunking=True  # bool
)

# Type checker knows output_dir is str
output: str = config.output_dir

IDE Support¶

IDEs provide autocomplete and type checking:

config = PipelineConfig(
    source="doc.pdf",
    template="templates.MyTemplate",
    backend="l"  # IDE suggests "llm"
)

Default Values¶

All fields have sensible defaults:

config = PipelineConfig(
    source="doc.pdf",
    template="templates.MyTemplate"
    # All other fields use defaults
)

print(config.backend)  # "llm"
print(config.inference)  # "local"
print(config.processing_mode)  # "many-to-one"
print(config.use_chunking)  # True
print(config.export_format)  # "csv"

Pipeline API - run_pipeline() function
Protocols - Protocol definitions
Exceptions - Validation errors

Configuration API¶

Overview¶

PipelineConfig¶

Constructor¶

Fields¶

Required Fields¶

Backend Configuration¶

Processing Configuration¶

Staged Tuning Configuration¶

Delta Configuration¶

Gleaning (direct and delta)¶

Export Configuration¶

Graph Configuration¶

Output Configuration¶

Methods¶

run()¶

to_dict()¶

generate_yaml_dict()¶

ModelsConfig¶

Fields¶

LLMConfig¶

Fields¶

VLMConfig¶

Fields¶

ModelConfig¶

Fields¶

BackendConfig¶

ExtractorConfig¶

Usage Examples¶

Basic Configuration¶

Custom Backend¶

Custom Processing¶

Custom Export¶

API Mode (No File Exports)¶

Debug Mode¶

Explicit Control¶

Complete Configuration¶

Validation¶

Automatic Validation¶

Custom Validation¶

Type Safety¶

Type Hints¶

IDE Support¶

Default Values¶

Related APIs¶

See Also¶