Advanced chunking & serialization

Overview

In this notebook we show how to customize the serialization strategies that come into play during chunking.

Setup

We will work with a document that contains some picture annotations:

from docling_core.types.doc.document import DoclingDocument

SOURCE = "./data/2408.09869v3_enriched.json"

doc = DoclingDocument.load_from_json(SOURCE)

Below we define the chunker (for more details check out Hybrid Chunking):

from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.transforms.chunker.tokenizer.base import BaseTokenizer
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer

EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"

tokenizer: BaseTokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
)
chunker = HybridChunker(tokenizer=tokenizer)

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

print(f"{tokenizer.get_max_tokens()=}")

tokenizer.get_max_tokens()=256

Defining some helper methods:

from typing import Iterable, Optional

from docling_core.transforms.chunker.base import BaseChunk
from docling_core.transforms.chunker.hierarchical_chunker import DocChunk
from docling_core.types.doc.labels import DocItemLabel
from rich.console import Console
from rich.panel import Panel

console = Console(
    width=200,  # for getting Markdown tables rendered nicely
)


def find_n_th_chunk_with_label(
    iter: Iterable[BaseChunk], n: int, label: DocItemLabel
) -> Optional[DocChunk]:
    num_found = -1
    for i, chunk in enumerate(iter):
        doc_chunk = DocChunk.model_validate(chunk)
        for it in doc_chunk.meta.doc_items:
            if it.label == label:
                num_found += 1
                if num_found == n:
                    return i, chunk
    return None, None


def print_chunk(chunks, chunk_pos):
    chunk = chunks[chunk_pos]
    ctx_text = chunker.contextualize(chunk=chunk)
    num_tokens = tokenizer.count_tokens(text=ctx_text)
    doc_items_refs = [it.self_ref for it in chunk.meta.doc_items]
    title = f"{chunk_pos=} {num_tokens=} {doc_items_refs=}"
    console.print(Panel(ctx_text, title=title))

Table serialization

Using the default strategy

Below we inspect the first chunk containing a table — using the default serialization strategy:

chunker = HybridChunker(tokenizer=tokenizer)

chunk_iter = chunker.chunk(dl_doc=doc)

chunks = list(chunk_iter)
i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)
print_chunk(
    chunks=chunks,
    chunk_pos=i,
)

Token indices sequence length is longer than the specified maximum sequence length for this model (2942 > 512). Running this sequence through the model will result in indexing errors

╭───────────────────────────────────────────────────────────────────── chunk_pos=17 num_tokens=261 doc_items_refs=['#/tables/0'] ──────────────────────────────────────────────────────────────────────╮
│ Docling Technical Report                                                                                                                                                                             │
│ 4 Performance                                                                                                                                                                                        │
│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution │
│ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads.           │
│ Apple M3 Max, Thread budget. = 4. Apple M3 Max, native backend.TTS = 177 s 167 s. Apple M3 Max, native backend.Pages/s = 1.27 1.34. Apple M3 Max, native backend.Mem = 6.20 GB. Apple M3 Max,        │
│ pypdfium backend.TTS = 103 s 92 s. Apple M3 Max, pypdfium backend.Pages/s = 2.18 2.45. Apple M3 Max, pypdfium backend.Mem = 2.56 GB. (16 cores) Intel(R) Xeon E5-2690, Thread budget. = 16 4 16. (16 │
│ cores) Intel(R) Xeon E5-2690, native                                                                                                                                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

INFO: As you see above, using the HybridChunker can sometimes lead to a warning from the transformers library, however this is a "false alarm" — for details check here.

Configuring a different strategy

We can configure a different serialization strategy. In the example below, we specify a different table serializer that serializes tables to Markdown instead of the triplet notation used by default:

from docling_core.transforms.chunker.hierarchical_chunker import (
    ChunkingDocSerializer,
    ChunkingSerializerProvider,
)
from docling_core.transforms.serializer.markdown import MarkdownTableSerializer


class MDTableSerializerProvider(ChunkingSerializerProvider):
    def get_serializer(self, doc):
        return ChunkingDocSerializer(
            doc=doc,
            table_serializer=MarkdownTableSerializer(),  # configuring a different table serializer
        )


chunker = HybridChunker(
    tokenizer=tokenizer,
    serializer_provider=MDTableSerializerProvider(),
)

chunk_iter = chunker.chunk(dl_doc=doc)

chunks = list(chunk_iter)
i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)
print_chunk(
    chunks=chunks,
    chunk_pos=i,
)

╭───────────────────────────────────────────────────────────────────── chunk_pos=17 num_tokens=262 doc_items_refs=['#/tables/0'] ──────────────────────────────────────────────────────────────────────╮
│ Docling Technical Report                                                                                                                                                                             │
│ 4 Performance                                                                                                                                                                                        │
│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution │
│ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads.           │
│                                                                                                                                                                                                      │
│ | CPU                              | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   |                       │
│                                                                                                                                                                                                      │
│ |----------------------------------|-----------------|------------------|------------------|------------------|-------------                                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Picture serialization

Using the default strategy

Below we inspect the first chunk containing a picture.

Even when using the default strategy, we can modify the relevant parameters, e.g. which placeholder is used for pictures:

from docling_core.transforms.serializer.markdown import MarkdownParams


class ImgPlaceholderSerializerProvider(ChunkingSerializerProvider):
    def get_serializer(self, doc):
        return ChunkingDocSerializer(
            doc=doc,
            params=MarkdownParams(
                image_placeholder="<!-- image -->",
            ),
        )


chunker = HybridChunker(
    tokenizer=tokenizer,
    serializer_provider=ImgPlaceholderSerializerProvider(),
)

chunk_iter = chunker.chunk(dl_doc=doc)

chunks = list(chunk_iter)
i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)
print_chunk(
    chunks=chunks,
    chunk_pos=i,
)

╭───────────────────────────────────────────────── chunk_pos=0 num_tokens=133 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] ──────────────────────────────────────────────────╮
│ Docling Technical Report                                                                                                                                                                             │
│ <!-- image -->                                                                                                                                                                                       │
│                                                                                                                                                                                                      │
│ In this image we can see a cartoon image of a duck holding a paper.                                                                                                                                  │
│ Version 1.0                                                                                                                                                                                          │
│ Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta  │
│ Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar                                                                                                  │
│ AI4K Group, IBM Research R¨ uschlikon, Switzerland                                                                                                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Using a custom strategy

Below we define and use our custom picture serialization strategy which leverages picture annotations:

from typing import Any

from docling_core.transforms.serializer.base import (
    BaseDocSerializer,
    SerializationResult,
)
from docling_core.transforms.serializer.common import create_ser_result
from docling_core.transforms.serializer.markdown import MarkdownPictureSerializer
from docling_core.types.doc.document import (
    PictureClassificationData,
    PictureDescriptionData,
    PictureItem,
    PictureMoleculeData,
)
from typing_extensions import override


class AnnotationPictureSerializer(MarkdownPictureSerializer):
    @override
    def serialize(
        self,
        *,
        item: PictureItem,
        doc_serializer: BaseDocSerializer,
        doc: DoclingDocument,
        **kwargs: Any,
    ) -> SerializationResult:
        text_parts: list[str] = []

        if item.meta is not None:
            if item.meta.classification is not None:
                main_pred = item.meta.classification.get_main_prediction()
                if main_pred is not None:
                    text_parts.append(f"Picture type: {main_pred.class_name}")

            if item.meta.molecule is not None:
                text_parts.append(f"SMILES: {item.meta.molecule.smi}")

            if item.meta.description is not None:
                text_parts.append(f"Picture description: {item.meta.description.text}")

        text_res = "\n".join(text_parts)
        text_res = doc_serializer.post_process(text=text_res)
        return create_ser_result(text=text_res, span_source=item)

class ImgAnnotationSerializerProvider(ChunkingSerializerProvider):
    def get_serializer(self, doc: DoclingDocument):
        return ChunkingDocSerializer(
            doc=doc,
            picture_serializer=AnnotationPictureSerializer(),  # configuring a different picture serializer
        )


chunker = HybridChunker(
    tokenizer=tokenizer,
    serializer_provider=ImgAnnotationSerializerProvider(),
)

chunk_iter = chunker.chunk(dl_doc=doc)

chunks = list(chunk_iter)
i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)
print_chunk(
    chunks=chunks,
    chunk_pos=i,
)

╭───────────────────────────────────────────────── chunk_pos=0 num_tokens=144 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] ──────────────────────────────────────────────────╮
│ Docling Technical Report                                                                                                                                                                             │
│ Picture description: In this image we can see a cartoon image of a duck holding a paper.                                                                                                             │
│                                                                                                                                                                                                      │
│ In this image we can see a cartoon image of a duck holding a paper.                                                                                                                                  │
│ Version 1.0                                                                                                                                                                                          │
│ Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta  │
│ Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar                                                                                                  │
│ AI4K Group, IBM Research R¨ uschlikon, Switzerland                                                                                                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Chunk expansion

In this section, we demonstrate how to expand chunks to include additional context from their containing document items or pages. This is useful when we want to ensure that chunks include complete semantic units or when we need more context for downstream tasks.

Expansion to containing DocItem

We can expand a chunk to include the full content of its containing document item. This ensures that the chunk contains the complete semantic unit (e.g., a full paragraph, section, list, or table) rather than a truncated portion.

from docling_core.transforms.chunker.chunk_expander import TreeChunkExpander
from docling_core.transforms.chunker.hierarchical_chunker import ChunkingDocSerializer

# Create a chunk expander for expanding to containing doc items
tree_expander = TreeChunkExpander()
serializer = MDTableSerializerProvider().get_serializer(doc=doc)

# Reuse the chunks from the previous table serialization example
# Find a chunk that contains a table (reusing the variable 'i' from earlier)
table_chunk_idx, table_chunk = find_n_th_chunk_with_label(
    chunks, n=0, label=DocItemLabel.TABLE
)

# Expand the chunk to include the full containing doc item (complete table)
expanded_chunk = tree_expander.expand(
    chunk=table_chunk, dl_doc=doc, serializer=serializer
)

# Compare original and expanded chunks
print("Original chunk (partial table):")
print_chunk(chunks=chunks, chunk_pos=table_chunk_idx)

print("\nExpanded chunk (complete table in containing doc item):")
ctx_text = chunker.contextualize(chunk=expanded_chunk)
num_tokens = tokenizer.count_tokens(text=ctx_text)
title = f"chunk_pos={table_chunk_idx} (expanded) {num_tokens=}"
console.print(Panel(ctx_text, title=title))

Original chunk (partial table):

╭───────────────────────────────────────────────────────────────────── chunk_pos=17 num_tokens=261 doc_items_refs=['#/tables/0'] ──────────────────────────────────────────────────────────────────────╮
│ Docling Technical Report                                                                                                                                                                             │
│ 4 Performance                                                                                                                                                                                        │
│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution │
│ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads.           │
│ Apple M3 Max, Thread budget. = 4. Apple M3 Max, native backend.TTS = 177 s 167 s. Apple M3 Max, native backend.Pages/s = 1.27 1.34. Apple M3 Max, native backend.Mem = 6.20 GB. Apple M3 Max,        │
│ pypdfium backend.TTS = 103 s 92 s. Apple M3 Max, pypdfium backend.Pages/s = 2.18 2.45. Apple M3 Max, pypdfium backend.Mem = 2.56 GB. (16 cores) Intel(R) Xeon E5-2690, Thread budget. = 16 4 16. (16 │
│ cores) Intel(R) Xeon E5-2690, native                                                                                                                                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Expanded chunk (complete table in containing doc item):

╭─────────────────────────────────────────────────────────────────────────────── chunk_pos=17 (expanded) num_tokens=431 ───────────────────────────────────────────────────────────────────────────────╮
│ Docling Technical Report                                                                                                                                                                             │
│ 4 Performance                                                                                                                                                                                        │
│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution │
│ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads.           │
│                                                                                                                                                                                                      │
│ | CPU                              | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   |                       │
│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------|                       │
│ |                                  |                 | TTS              | Pages/s          | Mem              | TTS                | Pages/s            | Mem                |                       │
│ | Apple M3 Max                     | 4               | 177 s 167 s      | 1.27 1.34        | 6.20 GB          | 103 s 92 s         | 2.18 2.45          | 2.56 GB            |                       │
│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16         | 375 s 244 s      | 0.60 0.92        | 6.16 GB          | 239 s 143 s        | 0.94 1.57          | 2.42 GB            |                       │
│                                                                                                                                                                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Expansion to containing page

We can also expand a chunk to include all content from its containing page. This is particularly useful when we need full page context for tasks like question answering or when working with documents where page boundaries are semantically important.

from docling_core.transforms.chunker.chunk_expander import PageChunkExpander

# Create a chunk expander for expanding to containing pages
page_expander = PageChunkExpander()

# Reuse the table chunk from the previous example
# Expand it to include all content from the containing page
expanded_chunk = page_expander.expand(
    chunk=table_chunk, dl_doc=doc, serializer=serializer
)

# Compare original and expanded chunks
print("Original chunk (partial table):")
print_chunk(chunks=chunks, chunk_pos=table_chunk_idx)

print("\nExpanded chunk (full page containing the table):")
ctx_text = chunker.contextualize(chunk=expanded_chunk)
num_tokens = tokenizer.count_tokens(text=ctx_text)
title = f"chunk_pos={table_chunk_idx} (expanded to page) {num_tokens=}"
console.print(Panel(ctx_text, title=title))

Original chunk (partial table):

╭───────────────────────────────────────────────────────────────────── chunk_pos=17 num_tokens=261 doc_items_refs=['#/tables/0'] ──────────────────────────────────────────────────────────────────────╮
│ Docling Technical Report                                                                                                                                                                             │
│ 4 Performance                                                                                                                                                                                        │
│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution │
│ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads.           │
│ Apple M3 Max, Thread budget. = 4. Apple M3 Max, native backend.TTS = 177 s 167 s. Apple M3 Max, native backend.Pages/s = 1.27 1.34. Apple M3 Max, native backend.Mem = 6.20 GB. Apple M3 Max,        │
│ pypdfium backend.TTS = 103 s 92 s. Apple M3 Max, pypdfium backend.Pages/s = 2.18 2.45. Apple M3 Max, pypdfium backend.Mem = 2.56 GB. (16 cores) Intel(R) Xeon E5-2690, Thread budget. = 16 4 16. (16 │
│ cores) Intel(R) Xeon E5-2690, native                                                                                                                                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Expanded chunk (full page containing the table):

╭────────────────────────────────────────────────────────────────────────── chunk_pos=17 (expanded to page) num_tokens=1209 ───────────────────────────────────────────────────────────────────────────╮
│ Docling Technical Report                                                                                                                                                                             │
│ 4 Performance                                                                                                                                                                                        │
│ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.                                                                            │
│                                                                                                                                                                                                      │
│ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution │
│ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads.           │
│                                                                                                                                                                                                      │
│ | CPU                              | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   |                       │
│ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------|                       │
│ |                                  |                 | TTS              | Pages/s          | Mem              | TTS                | Pages/s            | Mem                |                       │
│ | Apple M3 Max                     | 4               | 177 s 167 s      | 1.27 1.34        | 6.20 GB          | 103 s 92 s         | 2.18 2.45          | 2.56 GB            |                       │
│ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16         | 375 s 244 s      | 0.60 0.92        | 6.16 GB          | 239 s 143 s        | 0.94 1.57          | 2.42 GB            |                       │
│                                                                                                                                                                                                      │
│ ## 5 Applications                                                                                                                                                                                    │
│                                                                                                                                                                                                      │
│ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for        │
│ detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document,  │
│ such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source        │
│ package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex │
│ [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides            │
│ significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build          │
│ large-scale multi-modal training datasets.                                                                                                                                                           │
│                                                                                                                                                                                                      │
│ ## 6 Future work and contributions                                                                                                                                                                   │
│                                                                                                                                                                                                      │
│ Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an             │
│ equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of content, as well as augment extracted document metadata with    │
│ additional information. Further investment into testing and optimizing GPU acceleration as well as improving the Docling-native PDF backend are on our roadmap, too.                                 │
│                                                                                                                                                                                                      │
│ We encourage everyone to propose or implement additional features and models, and will gladly take your inputs and contributions under review . The codebase of Docling is open for use and          │
│ contribution, under the MIT license agreement and in alignment with our contributing guidelines included in the Docling repository. If you use Docling in your projects, please consider citing this │
│ technical report.                                                                                                                                                                                    │
│                                                                                                                                                                                                      │
│ ## References                                                                                                                                                                                        │
│                                                                                                                                                                                                      │
│ - [1] J. AI. Easyocr: Ready-to-use ocr with 80+ supported languages. https://github.com/ JaidedAI/EasyOCR , 2024. Version: 1.7.0.                                                                    │
│ - [2] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W.      │
│ Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, J. Liang, Y. Lu, C. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, M. Saroufim, M. Y.        │
│ Siraichi, H. Suk, M. Suo, P. Tillet, E. Wang, X. Wang, W. Wen, S. Zhang, X. Zhao, K. Zhou, R. Zou, A. Mathews, G. Chanan, P. Wu, and S. Chintala. Pytorch 2: Faster                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯