Delta Extraction¶

Overview¶

Delta extraction is an LLM extraction contract for many-to-one processing that turns document chunks into a flat graph IR (nodes and relationships), then normalizes, merges, and projects the result into your Pydantic template. It is designed for long documents and chunk-based workflows.

Set extraction_contract="delta" in your config or use --extraction-contract delta on the CLI. Chunking must be enabled (use_chunking=True, which is the default for many-to-one).

When to use:

Long documents where you want token-bounded batching (multiple chunks per LLM call, then merge by identity).
You prefer a graph-first representation: entities as nodes with path, ids, and parent, then projected to the template.
You want optional post-merge resolvers (fuzzy/semantic) to merge near-duplicate entities.

When to use direct (default) or staged:

Direct: Flat or simple templates; single-pass extraction and programmatic merge.
Staged: Complex nested templates; ID pass → fill pass → merge (no chunk batching).

How It Works¶

Delta extraction runs these steps:

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart TB
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    classDef subgraph_style fill:none,stroke:#969696,stroke-width:2px,stroke-dasharray: 5 5,color:#969696

    %% 2. Define Nodes
    n1@{ shape: terminal, label: "Source Chunks" }
    n2@{ shape: terminal, label: "Delta Template Config" }

    n3@{ shape: procs, label: "Batch Planning" }
    n3a@{ shape: lin-proc, label: "Greedy Token Packing" }

    n4@{ shape: tag-proc, label: "Per-batch LLM" }
    n5@{ shape: db, label: "Raw DeltaGraph" }

    n6@{ shape: lin-proc, label: "IR Normalization" }
    n7@{ shape: procs, label: "Graph Merge & Deduplication" }

    n8@{ shape: tag-proc, label: "Resolvers (Optional)" }
    n9@{ shape: tag-proc, label: "Identity Filter (Optional)" }

    n10@{ shape: procs, label: "Projection" }
    n11@{ shape: lin-proc, label: "Quality Gate Check" }

    n12@{ shape: tag-proc, label: "Direct Extraction Fallback" }
    n13@{ shape: terminal, label: "Final Result" }

    %% 3. Define Connections
    n1 & n2 --> n3
    n3 --> n3a
    n3a --> n4
    n4 --> n5
    n5 --> n6
    n6 --> n7

    %% Sequence of Logic
    n7 --> n8
    n8 --> n9
    n9 --> n10
    n10 --> n11

    %% Branching Logic
    n11 -- "Pass" --> n13
    n11 -- "Fail" --> n12
    n12 --> n13

    %% 4. Apply Classes
    class n1 input
    class n2 config
    class n5 data
    class n3,n7,n10,n11 process
    class n3a,n6 process
    class n4,n8,n9,n12 operator
    class n13 output

Chunking — Done outside delta (document processor or strategy). Produces chunks and optional chunk metadata (e.g. token counts, page numbers).
Batch planning — Chunks are packed into token-bounded batches (llm_batch_token_size). Each batch is sent in one LLM call.
Per-batch LLM — For each batch, the LLM receives the batch document plus a path catalog and semantic guide from your template. It returns a DeltaGraph: nodes (path, ids, parent, properties) and optional relationships. Output is validated with retries on failure.
IR normalization — Batch results are normalized: paths canonicalized to catalog paths, IDs normalized and optionally inferred from path indices, parent references repaired, nested properties stripped, provenance attached. Unknown paths can be dropped if delta_normalizer_validate_paths is true.
Graph merge — Normalized graphs are merged with deduplication by (path, identity). Node properties are merged (e.g. prefer longer string on conflict). Relationships are deduplicated by edge and endpoints.
Resolvers (optional) — If delta_resolvers_enabled is true, a post-merge pass can merge near-duplicate nodes by fuzzy or semantic similarity (delta_resolvers_mode: fuzzy, semantic, or chain).
Identity filter (optional) — If delta_identity_filter_enabled is true, entity nodes whose identity looks like a section/chapter title are dropped. With delta_identity_filter_strict true, only identities in the schema allowlist are kept.
Projection — The merged graph is projected into a template-shaped root dict: nodes are attached to parents via (path, ids). When a parent is missing (e.g. dropped by the identity filter), a best-effort attachment to the first available parent of the same path is attempted so more nodes stay in the tree.
Quality gate — The gate uses attached node count (nodes that made it into the root tree), not raw graph size. If attached_node_count is below delta_quality_min_instances (default 20) or parent lookup misses exceed the allowed tolerance, the gate fails. On fail, delta returns None and the many-to-one strategy falls back to direct extraction (full-document, single LLM call), which usually yields a richer graph for sparse delta runs.

Schema Requirements¶

Delta uses a catalog derived from your Pydantic template (same idea as staged):

Paths — Root "", then nested paths like line_items[], line_items[].item. The LLM must use only these catalog paths.
Identity — Entities with graph_id_fields get stable keys for dedup and parent linkage; list items often use a field like line_number or index.
Flat properties — Node and relationship properties must be flat (scalars or lists of scalars). Nested objects are stripped by the normalizer.
Root required fields — Required root-level fields (e.g. reference_document, title) should be documented in the template so the LLM can fill them; the catalog hints the root path to include required root-level fields when present in the document.

For identity and linkage best practices, see Schema design for staged extraction (same concepts apply to delta).

Configuration and options¶

All options can be set in Python via PipelineConfig or a config dict passed to run_pipeline(). CLI flags (when available) override config-file defaults.

Batching and parallelism¶

Python (`PipelineConfig` / config dict)	CLI flag	Default	Description
`extraction_contract`	`--extraction-contract`	`"direct"`	Set to `"delta"` to enable delta extraction.
`use_chunking`	`--use-chunking` / `--no-use-chunking`	`True`	Must be enabled for delta (chunk → batch flow).
`llm_batch_token_size`	`--llm-batch-token-size`	`1024`	Max input tokens per LLM batch; a new call is started when a batch would exceed this.
`parallel_workers`	`--parallel-workers`	`1` (or preset)	Number of parallel workers for delta batch LLM calls.
`staged_pass_retries`	`--staged-retries`	`1`	Retries per batch when the LLM returns invalid JSON (used as `max_pass_retries` for delta).

Quality gate¶

The gate uses attached node count (nodes successfully attached into the root tree during projection). If the gate fails, delta returns None and the strategy falls back to direct extraction.

Python (config dict)	Default	Description
`delta_quality_require_root`	`True`	Require at least one root instance (`path=""`).
`delta_quality_min_instances`	`20`	Minimum attached nodes; below this, gate fails and direct extraction is used.
`delta_quality_max_parent_lookup_miss`	`4`	Max allowed parent lookup misses before fail. Use `-1` to disable this check (e.g. for deep or id-sparse schemas).
`delta_quality_adaptive_parent_lookup`	`True`	When root exists, allow higher effective miss tolerance (e.g. up to half of instances, cap 300).
`delta_quality_require_relationships`	`False`	Require at least one relationship in the graph.

Identity filter¶

Python (config dict)	Default	Description
`delta_identity_filter_enabled`	`True`	Drop entity nodes whose identity looks like a section/chapter title.
`delta_identity_filter_strict`	`False`	If true, drop any entity whose identity is not in the schema allowlist (for paths with `identity_example_values`). If false, only section-title heuristic is applied.

Other gate options (e.g. delta_quality_require_structural_attachments, quality_max_unknown_path_drops, quality_max_id_mismatch, quality_max_nested_property_drops) are documented in the config reference. Quality gate and identity filter options are not CLI flags; set them in a config file or config dict.

IR normalizer¶

Python (config dict)	CLI flag	Default	Description
`delta_normalizer_validate_paths`	`--delta-normalizer-validate-paths` / `--no-delta-normalizer-validate-paths`	`True`	Drop or repair nodes with unknown catalog paths.
`delta_normalizer_canonicalize_ids`	`--delta-normalizer-canonicalize-ids` / `--no-delta-normalizer-canonicalize-ids`	`True`	Canonicalize ID values before merge.
`delta_normalizer_strip_nested_properties`	`--delta-normalizer-strip-nested-properties` / `--no-delta-normalizer-strip-nested-properties`	`True`	Drop nested dict/list-of-dict properties from nodes and relationships.
`delta_normalizer_attach_provenance`	(config only)	`True`	Attach batch/chunk provenance to normalized nodes and relationships.

Resolvers (post-merge dedup)¶

Optional pass to merge near-duplicate entities after the graph merge.

Python (config dict)	CLI flag	Default	Description
`delta_resolvers_enabled`	`--delta-resolvers-enabled` / `--no-delta-resolvers-enabled`	`True`	Enable the resolver pass.
`delta_resolvers_mode`	`--delta-resolvers-mode`	`"semantic"`	One of `off`, `fuzzy`, `semantic`, `chain`.
`delta_resolver_fuzzy_threshold`	`--delta-resolver-fuzzy-threshold`	`0.9`	Similarity threshold for fuzzy matching.
`delta_resolver_semantic_threshold`	`--delta-resolver-semantic-threshold`	`0.92`	Similarity threshold for semantic matching.
`delta_resolver_properties`	(config only)	`None`	List of property names used for matching; default uses catalog fallback fields.
`delta_resolver_paths`	(config only)	`None`	Restrict resolver to these catalog paths; empty means all.

Gleaning (direct and delta)¶

Optional second-pass extraction ("what did you miss?") to improve recall. Applies to direct and delta contracts only (not staged). Enabled by default.

Python (`PipelineConfig` / config dict)	CLI flag	Default	Description
`gleaning_enabled`	`--gleaning-enabled` / `--no-gleaning-enabled`	`True`	Run one extra extraction pass and merge additional entities/relations.
`gleaning_max_passes`	`--gleaning-max-passes`	`1`	Max number of gleaning passes when gleaning is enabled.

Usage¶

Python API¶

Pass options via PipelineConfig or a dict to run_pipeline():

from docling_graph import PipelineConfig, run_pipeline

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    processing_mode="many-to-one",
    extraction_contract="delta",
    use_chunking=True,
    # Batching and parallelism
    llm_batch_token_size=2048,
    parallel_workers=2,
    staged_pass_retries=1,
    # Quality gate (optional overrides)
    delta_quality_require_root=True,
    delta_quality_min_instances=1,
    delta_quality_max_parent_lookup_miss=4,
    delta_quality_adaptive_parent_lookup=True,
    # IR normalizer
    delta_normalizer_validate_paths=True,
    delta_normalizer_canonicalize_ids=True,
    delta_normalizer_strip_nested_properties=True,
    delta_normalizer_attach_provenance=True,
    # Resolvers (optional)
    delta_resolvers_enabled=True,
    delta_resolvers_mode="semantic",
    delta_resolver_fuzzy_threshold=0.9,
    delta_resolver_semantic_threshold=0.92,
    # Gleaning (optional second-pass recall; also applies to direct)
    gleaning_enabled=True,
    gleaning_max_passes=1,
)
context = run_pipeline(config)

The options delta_quality_require_relationships and delta_quality_require_structural_attachments are not fields on PipelineConfig; set them in a config file (e.g. defaults in your YAML) or in a config dict: run_pipeline({..., "delta_quality_require_relationships": False}).

CLI¶

All delta-related flags (when using --extraction-contract delta):

# Required for delta
uv run docling-graph convert document.pdf \
  --template "templates.BillingDocument" \
  --extraction-contract delta

# Batching and parallelism
uv run docling-graph convert document.pdf \
  --template "templates.BillingDocument" \
  --extraction-contract delta \
  --use-chunking \
  --llm-batch-token-size 2048 \
  --parallel-workers 2 \
  --staged-retries 1

# IR normalizer (toggles)
uv run docling-graph convert document.pdf \
  --extraction-contract delta \
  --template "templates.BillingDocument" \
  --delta-normalizer-validate-paths \
  --delta-normalizer-canonicalize-ids \
  --no-delta-normalizer-strip-nested-properties

# Resolvers
uv run docling-graph convert document.pdf \
  --extraction-contract delta \
  --template "templates.BillingDocument" \
  --delta-resolvers-enabled \
  --delta-resolvers-mode fuzzy \
  --delta-resolver-fuzzy-threshold 0.9 \
  --delta-resolver-semantic-threshold 0.92

Quality gate and resolver list options (delta_resolver_properties, delta_resolver_paths, delta_quality_*, quality_max_*) are not CLI flags; use a config file (e.g. defaults in config_template.yaml or your project config) to set them.

Trace and debugging¶

When delta runs, the pipeline emits a trace (e.g. via trace_data or debug artifacts) containing:

contract: "delta"
chunk_count, batch_count, batch_timings, batch_errors
path_counts, normalizer_stats, merge_stats, resolver (if enabled)
quality_gate: { ok, reasons }
diagnostics: e.g. top missing-id paths, unknown path examples, parent lookup miss examples

With debug=True, artifacts like delta_trace.json, delta_merged_graph.json, and delta_merged_output.json can be written to the debug directory.

Staged Extraction — Multi-pass ID → fill → merge (no chunk batching)
Extraction Backends — LLM vs VLM and extraction contracts
Configuration reference — Full config API
convert command — CLI flags for delta