Best Practices¶
Template checklist¶
- Define root identity with
graph_id_fields. - Keep entity IDs stable, short, and required.
- Use components for value objects (
is_entity=False). - Use explicit
edge(label=...)where relationship materialization matters. - Limit nesting depth (2-4 recommended).
- Use consistent naming and canonical examples.
Deterministic extraction guidance¶
- Prefer schema fields that appear explicitly in source documents.
- Add canonicalization hints for dates, units, and codes.
- Avoid identity fields that require invention by the model.
- Use lenient validators that normalize values instead of rejecting entire output.
Delta-specific quality guidance¶
- Keep node properties flat (primitives or list of primitives).
- Use path-consistent relationship structures.
- Design local entities with parent context available in schema.
- Avoid ambiguous fallback identities by exposing meaningful discriminator fields.
Identity and descriptive IDs¶
- Prefer descriptive, human-readable identifiers (e.g.
STUDY-TEMPERATURE-DEPENDENCE,Phenomenological fitting) over bare section labels. - Avoid using only a section letter or Roman numeral (e.g.
C,V,VI) as an identity field when the document uses them as headings; in the Field description, add: “Avoid using only a section letter or Roman numeral; prefer a descriptive label or combine with topic.” - Give 2–5 concrete examples that show the desired style (e.g.
FIG-4,STUDY-BINDER-MW); avoid examples that look like raw headings (e.g.3.1,Section-3) if you want the model to prefer descriptive IDs.
Optionality for often-missing fields¶
- Keep identity fields required when possible; the pipeline uses them for merge and linkage.
- For non-identity fields that the LLM often omits (e.g. objective, protocol details, axis labels), consider making them
Optionalwith defaultNoneso validation passes without salvage and the graph does not store synthetic placeholders. - Use optionality when the value is genuinely sparse in the source (e.g. “if not in Methods, leave empty”) rather than making every field optional.
Deduplicating root-level lists (chunked extraction)¶
In delta (and other chunked) extraction, root-level list fields without identity (e.g. authors) can receive the same item from multiple batches, so the merged list may contain duplicates.
- Add a model_validator (
mode="after") on the root model that deduplicates the list by a stable key (e.g.full_namefor authors: keep first occurrence per normalized name). - This keeps the pipeline domain-agnostic; dedup logic lives in the template and applies after validation.
Common failure causes¶
- Optional identity fields.
- Over-nested schemas with weak parent identifiers.
- Vague descriptions with no extraction hints.
- Inconsistent examples across equivalent fields.
- Edge labels omitted on relationship-bearing fields.
- Identity field descriptions that encourage raw section/figure labels (e.g. “section number”) without discouraging single-letter or Roman-numeral-only IDs.