Advanced chunking & serialization
Overview
In this notebook we show how to customize the serialization strategies that come into play during chunking.
Setup
We will work with a document that contains some picture annotations:
from docling_core.types.doc.document import DoclingDocument
SOURCE = "./data/2408.09869v3_enriched.json"
doc = DoclingDocument.load_from_json(SOURCE)
Below we define the chunker (for more details check out Hybrid Chunking):
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.transforms.chunker.tokenizer.base import BaseTokenizer
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer: BaseTokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
)
chunker = HybridChunker(tokenizer=tokenizer)
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
print(f"{tokenizer.get_max_tokens()=}")
tokenizer.get_max_tokens()=256
Defining some helper methods:
from typing import Iterable, Optional
from docling_core.transforms.chunker.base import BaseChunk
from docling_core.transforms.chunker.hierarchical_chunker import DocChunk
from docling_core.types.doc.labels import DocItemLabel
from rich.console import Console
from rich.panel import Panel
console = Console(
width=200, # for getting Markdown tables rendered nicely
)
def find_n_th_chunk_with_label(
iter: Iterable[BaseChunk], n: int, label: DocItemLabel
) -> Optional[DocChunk]:
num_found = -1
for i, chunk in enumerate(iter):
doc_chunk = DocChunk.model_validate(chunk)
for it in doc_chunk.meta.doc_items:
if it.label == label:
num_found += 1
if num_found == n:
return i, chunk
return None, None
def print_chunk(chunks, chunk_pos):
chunk = chunks[chunk_pos]
ctx_text = chunker.contextualize(chunk=chunk)
num_tokens = tokenizer.count_tokens(text=ctx_text)
doc_items_refs = [it.self_ref for it in chunk.meta.doc_items]
title = f"{chunk_pos=} {num_tokens=} {doc_items_refs=}"
console.print(Panel(ctx_text, title=title))
Table serialization
Using the default strategy
Below we inspect the first chunk containing a table โ using the default serialization strategy:
chunker = HybridChunker(tokenizer=tokenizer)
chunk_iter = chunker.chunk(dl_doc=doc)
chunks = list(chunk_iter)
i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)
print_chunk(
chunks=chunks,
chunk_pos=i,
)
Token indices sequence length is longer than the specified maximum sequence length for this model (2942 > 512). Running this sequence through the model will result in indexing errors
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ chunk_pos=17 num_tokens=261 doc_items_refs=['#/tables/0'] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ Docling Technical Report โ โ 4 Performance โ โ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution โ โ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. โ โ Apple M3 Max, Thread budget. = 4. Apple M3 Max, native backend.TTS = 177 s 167 s. Apple M3 Max, native backend.Pages/s = 1.27 1.34. Apple M3 Max, native backend.Mem = 6.20 GB. Apple M3 Max, โ โ pypdfium backend.TTS = 103 s 92 s. Apple M3 Max, pypdfium backend.Pages/s = 2.18 2.45. Apple M3 Max, pypdfium backend.Mem = 2.56 GB. (16 cores) Intel(R) Xeon E5-2690, Thread budget. = 16 4 16. (16 โ โ cores) Intel(R) Xeon E5-2690, native โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
HybridChunker can sometimes lead to a warning from the transformers library, however this is a "false alarm" โ for details check here.
Configuring a different strategy
We can configure a different serialization strategy. In the example below, we specify a different table serializer that serializes tables to Markdown instead of the triplet notation used by default:
from docling_core.transforms.chunker.hierarchical_chunker import (
ChunkingDocSerializer,
ChunkingSerializerProvider,
)
from docling_core.transforms.serializer.markdown import MarkdownTableSerializer
class MDTableSerializerProvider(ChunkingSerializerProvider):
def get_serializer(self, doc):
return ChunkingDocSerializer(
doc=doc,
table_serializer=MarkdownTableSerializer(), # configuring a different table serializer
)
chunker = HybridChunker(
tokenizer=tokenizer,
serializer_provider=MDTableSerializerProvider(),
)
chunk_iter = chunker.chunk(dl_doc=doc)
chunks = list(chunk_iter)
i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.TABLE)
print_chunk(
chunks=chunks,
chunk_pos=i,
)
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ chunk_pos=17 num_tokens=262 doc_items_refs=['#/tables/0'] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ Docling Technical Report โ โ 4 Performance โ โ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution โ โ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. โ โ โ โ | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | โ โ โ โ |----------------------------------|-----------------|------------------|------------------|------------------|------------- โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Picture serialization
Using the default strategy
Below we inspect the first chunk containing a picture.
Even when using the default strategy, we can modify the relevant parameters, e.g. which placeholder is used for pictures:
from docling_core.transforms.serializer.markdown import MarkdownParams
class ImgPlaceholderSerializerProvider(ChunkingSerializerProvider):
def get_serializer(self, doc):
return ChunkingDocSerializer(
doc=doc,
params=MarkdownParams(
image_placeholder="<!-- image -->",
),
)
chunker = HybridChunker(
tokenizer=tokenizer,
serializer_provider=ImgPlaceholderSerializerProvider(),
)
chunk_iter = chunker.chunk(dl_doc=doc)
chunks = list(chunk_iter)
i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)
print_chunk(
chunks=chunks,
chunk_pos=i,
)
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ chunk_pos=0 num_tokens=133 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ Docling Technical Report โ โ <!-- image --> โ โ โ โ In this image we can see a cartoon image of a duck holding a paper. โ โ Version 1.0 โ โ Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta โ โ Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar โ โ AI4K Group, IBM Research Rยจ uschlikon, Switzerland โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Using a custom strategy
Below we define and use our custom picture serialization strategy which leverages picture annotations:
from typing import Any
from docling_core.transforms.serializer.base import (
BaseDocSerializer,
SerializationResult,
)
from docling_core.transforms.serializer.common import create_ser_result
from docling_core.transforms.serializer.markdown import MarkdownPictureSerializer
from docling_core.types.doc.document import (
PictureClassificationData,
PictureDescriptionData,
PictureItem,
PictureMoleculeData,
)
from typing_extensions import override
class AnnotationPictureSerializer(MarkdownPictureSerializer):
@override
def serialize(
self,
*,
item: PictureItem,
doc_serializer: BaseDocSerializer,
doc: DoclingDocument,
**kwargs: Any,
) -> SerializationResult:
text_parts: list[str] = []
if item.meta is not None:
if item.meta.classification is not None:
main_pred = item.meta.classification.get_main_prediction()
if main_pred is not None:
text_parts.append(f"Picture type: {main_pred.class_name}")
if item.meta.molecule is not None:
text_parts.append(f"SMILES: {item.meta.molecule.smi}")
if item.meta.description is not None:
text_parts.append(f"Picture description: {item.meta.description.text}")
text_res = "\n".join(text_parts)
text_res = doc_serializer.post_process(text=text_res)
return create_ser_result(text=text_res, span_source=item)
class ImgAnnotationSerializerProvider(ChunkingSerializerProvider):
def get_serializer(self, doc: DoclingDocument):
return ChunkingDocSerializer(
doc=doc,
picture_serializer=AnnotationPictureSerializer(), # configuring a different picture serializer
)
chunker = HybridChunker(
tokenizer=tokenizer,
serializer_provider=ImgAnnotationSerializerProvider(),
)
chunk_iter = chunker.chunk(dl_doc=doc)
chunks = list(chunk_iter)
i, chunk = find_n_th_chunk_with_label(chunks, n=0, label=DocItemLabel.PICTURE)
print_chunk(
chunks=chunks,
chunk_pos=i,
)
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ chunk_pos=0 num_tokens=144 doc_items_refs=['#/pictures/0', '#/texts/2', '#/texts/3', '#/texts/4'] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ Docling Technical Report โ โ Picture description: In this image we can see a cartoon image of a duck holding a paper. โ โ โ โ In this image we can see a cartoon image of a duck holding a paper. โ โ Version 1.0 โ โ Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta โ โ Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar โ โ AI4K Group, IBM Research Rยจ uschlikon, Switzerland โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Chunk expansion
In this section, we demonstrate how to expand chunks to include additional context from their containing document items or pages. This is useful when we want to ensure that chunks include complete semantic units or when we need more context for downstream tasks.
Expansion to containing DocItem
We can expand a chunk to include the full content of its containing document item. This ensures that the chunk contains the complete semantic unit (e.g., a full paragraph, section, list, or table) rather than a truncated portion.
from docling_core.transforms.chunker.chunk_expander import TreeChunkExpander
from docling_core.transforms.chunker.hierarchical_chunker import ChunkingDocSerializer
# Create a chunk expander for expanding to containing doc items
tree_expander = TreeChunkExpander()
serializer = MDTableSerializerProvider().get_serializer(doc=doc)
# Reuse the chunks from the previous table serialization example
# Find a chunk that contains a table (reusing the variable 'i' from earlier)
table_chunk_idx, table_chunk = find_n_th_chunk_with_label(
chunks, n=0, label=DocItemLabel.TABLE
)
# Expand the chunk to include the full containing doc item (complete table)
expanded_chunk = tree_expander.expand(
chunk=table_chunk, dl_doc=doc, serializer=serializer
)
# Compare original and expanded chunks
print("Original chunk (partial table):")
print_chunk(chunks=chunks, chunk_pos=table_chunk_idx)
print("\nExpanded chunk (complete table in containing doc item):")
ctx_text = chunker.contextualize(chunk=expanded_chunk)
num_tokens = tokenizer.count_tokens(text=ctx_text)
title = f"chunk_pos={table_chunk_idx} (expanded) {num_tokens=}"
console.print(Panel(ctx_text, title=title))
Original chunk (partial table):
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ chunk_pos=17 num_tokens=261 doc_items_refs=['#/tables/0'] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ Docling Technical Report โ โ 4 Performance โ โ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution โ โ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. โ โ Apple M3 Max, Thread budget. = 4. Apple M3 Max, native backend.TTS = 177 s 167 s. Apple M3 Max, native backend.Pages/s = 1.27 1.34. Apple M3 Max, native backend.Mem = 6.20 GB. Apple M3 Max, โ โ pypdfium backend.TTS = 103 s 92 s. Apple M3 Max, pypdfium backend.Pages/s = 2.18 2.45. Apple M3 Max, pypdfium backend.Mem = 2.56 GB. (16 cores) Intel(R) Xeon E5-2690, Thread budget. = 16 4 16. (16 โ โ cores) Intel(R) Xeon E5-2690, native โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Expanded chunk (complete table in containing doc item):
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ chunk_pos=17 (expanded) num_tokens=431 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ Docling Technical Report โ โ 4 Performance โ โ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution โ โ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. โ โ โ โ | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | โ โ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| โ โ | | | TTS | Pages/s | Mem | TTS | Pages/s | Mem | โ โ | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB | โ โ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB | โ โ โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Expansion to containing page
We can also expand a chunk to include all content from its containing page. This is particularly useful when we need full page context for tasks like question answering or when working with documents where page boundaries are semantically important.
from docling_core.transforms.chunker.chunk_expander import PageChunkExpander
# Create a chunk expander for expanding to containing pages
page_expander = PageChunkExpander()
# Reuse the table chunk from the previous example
# Expand it to include all content from the containing page
expanded_chunk = page_expander.expand(
chunk=table_chunk, dl_doc=doc, serializer=serializer
)
# Compare original and expanded chunks
print("Original chunk (partial table):")
print_chunk(chunks=chunks, chunk_pos=table_chunk_idx)
print("\nExpanded chunk (full page containing the table):")
ctx_text = chunker.contextualize(chunk=expanded_chunk)
num_tokens = tokenizer.count_tokens(text=ctx_text)
title = f"chunk_pos={table_chunk_idx} (expanded to page) {num_tokens=}"
console.print(Panel(ctx_text, title=title))
Original chunk (partial table):
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ chunk_pos=17 num_tokens=261 doc_items_refs=['#/tables/0'] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ Docling Technical Report โ โ 4 Performance โ โ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution โ โ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. โ โ Apple M3 Max, Thread budget. = 4. Apple M3 Max, native backend.TTS = 177 s 167 s. Apple M3 Max, native backend.Pages/s = 1.27 1.34. Apple M3 Max, native backend.Mem = 6.20 GB. Apple M3 Max, โ โ pypdfium backend.TTS = 103 s 92 s. Apple M3 Max, pypdfium backend.Pages/s = 2.18 2.45. Apple M3 Max, pypdfium backend.Mem = 2.56 GB. (16 cores) Intel(R) Xeon E5-2690, Thread budget. = 16 4 16. (16 โ โ cores) Intel(R) Xeon E5-2690, native โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Expanded chunk (full page containing the table):
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ chunk_pos=17 (expanded to page) num_tokens=1209 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ Docling Technical Report โ โ 4 Performance โ โ torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report. โ โ โ โ Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution โ โ (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads. โ โ โ โ | CPU | Thread budget | native backend | native backend | native backend | pypdfium backend | pypdfium backend | pypdfium backend | โ โ |----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------| โ โ | | | TTS | Pages/s | Mem | TTS | Pages/s | Mem | โ โ | Apple M3 Max | 4 | 177 s 167 s | 1.27 1.34 | 6.20 GB | 103 s 92 s | 2.18 2.45 | 2.56 GB | โ โ | (16 cores) Intel(R) Xeon E5-2690 | 16 4 16 | 375 s 244 s | 0.60 0.92 | 6.16 GB | 239 s 143 s | 0.94 1.57 | 2.42 GB | โ โ โ โ ## 5 Applications โ โ โ โ Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for โ โ detailed enterprise document search, passage retrieval or classification use-cases, or support knowledge extraction pipelines, allowing specific treatment of different structures in the document, โ โ such as tables, figures, section structure or references. For popular generative AI application patterns, such as retrieval-augmented generation (RAG), we provide quackling , an open-source โ โ package which capitalizes on Docling's feature-rich document output to enable document-native optimized vector embedding and chunking. It plugs in seamlessly with LLM frameworks such as LlamaIndex โ โ [8]. Since Docling is fast, stable and cheap to run, it also makes for an excellent choice to build document-derived datasets. With its powerful table structure recognition, it provides โ โ significant benefit to automated knowledge-base construction [11, 10]. Docling is also integrated within the open IBM data prep kit [6], which implements scalable data transforms to build โ โ large-scale multi-modal training datasets. โ โ โ โ ## 6 Future work and contributions โ โ โ โ Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an โ โ equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of content, as well as augment extracted document metadata with โ โ additional information. Further investment into testing and optimizing GPU acceleration as well as improving the Docling-native PDF backend are on our roadmap, too. โ โ โ โ We encourage everyone to propose or implement additional features and models, and will gladly take your inputs and contributions under review . The codebase of Docling is open for use and โ โ contribution, under the MIT license agreement and in alignment with our contributing guidelines included in the Docling repository. If you use Docling in your projects, please consider citing this โ โ technical report. โ โ โ โ ## References โ โ โ โ - [1] J. AI. Easyocr: Ready-to-use ocr with 80+ supported languages. https://github.com/ JaidedAI/EasyOCR , 2024. Version: 1.7.0. โ โ - [2] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. โ โ Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, J. Liang, Y. Lu, C. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, M. Saroufim, M. Y. โ โ Siraichi, H. Suk, M. Suo, P. Tillet, E. Wang, X. Wang, W. Wen, S. Zhang, X. Zhao, K. Zhou, R. Zou, A. Mathews, G. Chanan, P. Wu, and S. Chintala. Pytorch 2: Faster โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ