Skip to content

Serialization

Introduction

A document serializer (AKA simply serializer) is a Docling abstraction that is initialized with a given DoclingDocument and returns a textual representation for that document.

Besides the document serializer, Docling defines similar abstractions for several document subcomponents, for example: text serializer, table serializer, picture serializer, list serializer, inline serializer, and more.

Last but not least, a serializer provider is a wrapper that abstracts the document serialization strategy from the document instance.

Base classes

To enable both flexibility for downstream applications and out-of-the-box utility, Docling defines a serialization class hierarchy, providing:

  • base types for the above abstractions: BaseDocSerializer, as well as BaseTextSerializer, BaseTableSerializer etc, and BaseSerializerProvider, and
  • specific subclasses for the above-mentioned base types, e.g. MarkdownDocSerializer.

You can review all methods required to define the above base classes here.

From a client perspective, the most relevant is BaseDocSerializer.serialize(), which returns the textual representation, as well as relevant metadata on which document components contributed to that serialization.

Use in DoclingDocument export methods

Docling provides predefined serializers for Markdown, HTML, and DocTags.

The respective DoclingDocument export methods (e.g. export_to_markdown()) are provided as user shorthands — internally directly instantiating and delegating to respective serializers.

Examples

For an example showcasing how to use serializers, see here.