Serialization
Introduction
A document serializer (AKA simply serializer) is a Docling abstraction that is
initialized with a given DoclingDocument
and returns a
textual representation for that document.
Besides the document serializer, Docling defines similar abstractions for several document subcomponents, for example: text serializer, table serializer, picture serializer, list serializer, inline serializer, and more.
Last but not least, a serializer provider is a wrapper that abstracts the document serialization strategy from the document instance.
Base classes
To enable both flexibility for downstream applications and out-of-the-box utility, Docling defines a serialization class hierarchy, providing:
- base types for the above abstractions:
BaseDocSerializer
, as well asBaseTextSerializer
,BaseTableSerializer
etc, andBaseSerializerProvider
, and - specific subclasses for the above-mentioned base types, e.g.
MarkdownDocSerializer
.
You can review all methods required to define the above base classes here.
From a client perspective, the most relevant is BaseDocSerializer.serialize()
, which
returns the textual representation, as well as relevant metadata on which document
components contributed to that serialization.
Use in DoclingDocument
export methods
Docling provides predefined serializers for Markdown, HTML, and DocTags.
The respective DoclingDocument
export methods (e.g. export_to_markdown()
) are
provided as user shorthands — internally directly instantiating and delegating to
respective serializers.
Examples
For an example showcasing how to use serializers, see here.