Hybrid chunking¶
Overview¶
Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking.
For more details, see here.
Setup¶
%pip install -qU pip docling transformers
Note: you may need to restart the kernel to use updated packages.
DOC_SOURCE = "../../tests/data/md/wiki.md"
Basic usage¶
We first convert the document:
from docling.document_converter import DocumentConverter
doc = DocumentConverter().convert(source=DOC_SOURCE).document
For a basic chunking scenario, we can just instantiate a HybridChunker, which will use
the default parameters.
from docling.chunking import HybridChunker
chunker = HybridChunker()
chunk_iter = chunker.chunk(dl_doc=doc)
Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors
👉 NOTE: As you see above, using the
HybridChunkercan sometimes lead to a warning from the transformers library, however this is a "false alarm" — for details check here.
Note that the text you would typically want to embed is the context-enriched one as
returned by the contextualize() method:
for i, chunk in enumerate(chunk_iter):
print(f"=== {i} ===")
print(f"chunk.text:\n{f'{chunk.text[:300]}…'!r}")
enriched_text = chunker.contextualize(chunk=chunk)
print(f"chunker.contextualize(chunk):\n{f'{enriched_text[:300]}…'!r}")
print()
=== 0 === chunk.text: 'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Aver…' chunker.contextualize(chunk): 'IBM\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial …' === 1 === chunk.text: 'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since the 19…' chunker.contextualize(chunk): 'IBM\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad. Since th…' === 2 === chunk.text: 'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19] and Willa…' chunker.contextualize(chunk): 'IBM\n1910s–1950s\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E. Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889…' === 3 === chunk.text: 'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John Henry Patterson,…' chunker.contextualize(chunk): 'IBM\n1910s–1950s\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr., fired from the National Cash Register Company by John …' === 4 === chunk.text: 'He implemented sales conventions, "generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker".[25][26] His favorite slogan, "THINK", became a mantra for each compa…' chunker.contextualize(chunk): 'IBM\n1910s–1950s\nHe implemented sales conventions, "generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker".[25][26] His favorite slogan, "THINK", became a mantr…' === 5 === chunk.text: 'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.…' chunker.contextualize(chunk): 'IBM\n1960s–1980s\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.…'
Configuring tokenization¶
For more control on the chunking, we can parametrize tokenization as shown below.
In a RAG / retrieval context, it is important to make sure that the chunker and embedding model are using the same tokenizer.
👉 HuggingFace transformers tokenizers can be used as shown in the following example:
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer
from docling.chunking import HybridChunker
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
MAX_TOKENS = 64 # set to a small number for illustrative purposes
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
max_tokens=MAX_TOKENS, # optional, by default derived from `tokenizer` for HF case
)
👉 Alternatively, OpenAI tokenizers can be used as shown in the example below (uncomment to use — requires installing docling-core[chunking-openai]):
# import tiktoken
# from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
# tokenizer = OpenAITokenizer(
# tokenizer=tiktoken.encoding_for_model("gpt-4o"),
# max_tokens=128 * 1024, # context window length required for OpenAI tokenizers
# )
We can now instantiate our chunker:
chunker = HybridChunker(
tokenizer=tokenizer,
merge_peers=True, # optional, defaults to True
)
chunk_iter = chunker.chunk(dl_doc=doc)
chunks = list(chunk_iter)
Points to notice looking at the output chunks below:
- Where possible, we fit the limit of 64 tokens for the metadata-enriched serialization form (see chunk 2)
- Where needed, we stop before the limit, e.g. see cases of 63 as it would otherwise run into a comma (see chunk 6)
- Where possible, we merge undersized peer chunks (see chunk 0)
- "Tail" chunks trailing right after merges may still be undersized (see chunk 8)
for i, chunk in enumerate(chunks):
print(f"=== {i} ===")
txt_tokens = tokenizer.count_tokens(chunk.text)
print(f"chunk.text ({txt_tokens} tokens):\n{chunk.text!r}")
ser_txt = chunker.contextualize(chunk=chunk)
ser_tokens = tokenizer.count_tokens(ser_txt)
print(f"chunker.contextualize(chunk) ({ser_tokens} tokens):\n{ser_txt!r}")
print()
=== 0 === chunk.text (55 tokens): 'International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.' chunker.contextualize(chunk) (56 tokens): 'IBM\nInternational Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over 175 countries.\nIt is a publicly traded company and one of the 30 companies in the Dow Jones Industrial Average.' === 1 === chunk.text (45 tokens): 'IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.' chunker.contextualize(chunk) (46 tokens): 'IBM\nIBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for 29 consecutive years from 1993 to 2021.' === 2 === chunk.text (56 tokens): 'IBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed "International Business Machines" in 1924 and soon became the leading manufacturer of punch-card tabulating systems.' chunker.contextualize(chunk) (57 tokens): 'IBM\nIBM was founded in 1911 as the Computing-Tabulating-Recording Company (CTR), a holding company of manufacturers of record-keeping and measuring systems. It was renamed "International Business Machines" in 1924 and soon became the leading manufacturer of punch-card tabulating systems.' === 3 === chunk.text (51 tokens): "During the 1960s and 1970s, the IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]" chunker.contextualize(chunk) (52 tokens): "IBM\nDuring the 1960s and 1970s, the IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and 70 percent of computers worldwide.[11]" === 4 === chunk.text (59 tokens): 'IBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad.' chunker.contextualize(chunk) (60 tokens): 'IBM\nIBM debuted in the microcomputer market in 1981 with the IBM Personal Computer, — its DOS software provided by Microsoft, — which became the basis for the majority of personal computers to the present day.[12] The company later also found success in the portable space with the ThinkPad.' === 5 === chunk.text (36 tokens): 'Since the 1990s, IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005.' chunker.contextualize(chunk) (37 tokens): 'IBM\nSince the 1990s, IBM has concentrated on computer services, software, supercomputers, and scientific research; it sold its microcomputer division to Lenovo in 2005.' === 6 === chunk.text (29 tokens): 'IBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.' chunker.contextualize(chunk) (30 tokens): 'IBM\nIBM continues to develop mainframes, and its supercomputers have consistently ranked among the most powerful in the world in the 21st century.' === 7 === chunk.text (59 tokens): "As one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database," chunker.contextualize(chunk) (60 tokens): "IBM\nAs one of the world's oldest and largest technology companies, IBM has been responsible for several technological innovations, including the automated teller machine (ATM), dynamic random-access memory (DRAM), the floppy disk, the hard disk drive, the magnetic stripe card, the relational database," === 8 === chunk.text (12 tokens): 'the SQL programming language, and the UPC barcode.' chunker.contextualize(chunk) (13 tokens): 'IBM\nthe SQL programming language, and the UPC barcode.' === 9 === chunk.text (59 tokens): 'The company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing Awards.[16]' chunker.contextualize(chunk) (60 tokens): 'IBM\nThe company has made inroads in advanced computer chips, quantum computing, artificial intelligence, and data infrastructure.[13][14][15] IBM employees and alumni have won various recognitions for their scientific research and inventions, including six Nobel Prizes and six Turing Awards.[16]' === 10 === chunk.text (19 tokens): 'IBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E.' chunker.contextualize(chunk) (23 tokens): 'IBM\n1910s–1950s\nIBM originated with several technological innovations developed and commercialized in the late 19th century. Julius E.' === 11 === chunk.text (44 tokens): 'Pitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19]' chunker.contextualize(chunk) (48 tokens): 'IBM\n1910s–1950s\nPitrap patented the computing scale in 1885;[17] Alexander Dey invented the dial recorder (1888);[18] Herman Hollerith patented the Electric Tabulating Machine (1889);[19]' === 12 === chunk.text (31 tokens): "and Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16," chunker.contextualize(chunk) (35 tokens): "IBM\n1910s–1950s\nand Willard Bundy invented a time clock to record workers' arrival and departure times on a paper tape (1889).[20] On June 16," === 13 === chunk.text (39 tokens): '1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the Computing-Tabulating-Recording Company (CTR) based in Endicott,' chunker.contextualize(chunk) (43 tokens): 'IBM\n1910s–1950s\n1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the Computing-Tabulating-Recording Company (CTR) based in Endicott,' === 14 === chunk.text (55 tokens): 'New York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York;\nDayton, Ohio; Detroit, Michigan; Washington, D.C.; and Toronto, Canada.[22]' chunker.contextualize(chunk) (59 tokens): 'IBM\n1910s–1950s\nNew York.[1][21] The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York;\nDayton, Ohio; Detroit, Michigan; Washington, D.C.; and Toronto, Canada.[22]' === 15 === chunk.text (42 tokens): 'Collectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J.' chunker.contextualize(chunk) (46 tokens): 'IBM\n1910s–1950s\nCollectively, the companies manufactured a wide array of machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J.' === 16 === chunk.text (50 tokens): 'Watson, Sr., fired from the National Cash Register Company by John Henry Patterson, called on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later,' chunker.contextualize(chunk) (54 tokens): 'IBM\n1910s–1950s\nWatson, Sr., fired from the National Cash Register Company by John Henry Patterson, called on Flint and, in 1914, was offered a position at CTR.[23] Watson joined CTR as general manager and then, 11 months later,' === 17 === chunk.text (50 tokens): "was made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\u200a105" chunker.contextualize(chunk) (54 tokens): "IBM\n1910s–1950s\nwas made President when antitrust cases relating to his time at NCR were resolved.[24] Having learned Patterson's pioneering business practices, Watson proceeded to put the stamp of NCR onto CTR's companies.[23]:\u200a105" === 18 === chunk.text (59 tokens): 'He implemented sales conventions, "generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker".[25][26] His favorite slogan,' chunker.contextualize(chunk) (63 tokens): 'IBM\n1910s–1950s\nHe implemented sales conventions, "generous sales incentives, a focus on customer service, an insistence on well-groomed, dark-suited salesmen and had an evangelical fervor for instilling company pride and loyalty in every worker".[25][26] His favorite slogan,' === 19 === chunk.text (49 tokens): '"THINK", became a mantra for each company\'s employees.[25] During Watson\'s first four years, revenues reached $9 million ($158 million today) and the company\'s operations expanded to Europe, South America,' chunker.contextualize(chunk) (53 tokens): 'IBM\n1910s–1950s\n"THINK", became a mantra for each company\'s employees.[25] During Watson\'s first four years, revenues reached $9 million ($158 million today) and the company\'s operations expanded to Europe, South America,' === 20 === chunk.text (60 tokens): 'Asia and Australia.[25] Watson never liked the clumsy hyphenated name "Computing-Tabulating-Recording Company" and chose to replace it with the more expansive title "International Business Machines" which had previously been used as the name of CTR\'s Canadian Division;[27]' chunker.contextualize(chunk) (64 tokens): 'IBM\n1910s–1950s\nAsia and Australia.[25] Watson never liked the clumsy hyphenated name "Computing-Tabulating-Recording Company" and chose to replace it with the more expansive title "International Business Machines" which had previously been used as the name of CTR\'s Canadian Division;[27]' === 21 === chunk.text (29 tokens): 'the name was changed on February 14,\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.' chunker.contextualize(chunk) (33 tokens): 'IBM\n1910s–1950s\nthe name was changed on February 14,\n1924.[28] By 1933, most of the subsidiaries had been merged into one company, IBM.' === 22 === chunk.text (22 tokens): 'In 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.' chunker.contextualize(chunk) (26 tokens): 'IBM\n1960s–1980s\nIn 1961, IBM developed the SABRE reservation system for American Airlines and introduced the highly successful Selectric typewriter.'
Table chunking with header repetition¶
When chunking documents with tables, the HybridChunker can repeat table headers in each chunk to maintain context. This is particularly useful for wide tables where the content spans multiple chunks.
Let's demonstrate this with a CSV file containing customer data.
# Convert a CSV file with a wide table
CSV_SOURCE = "../../tests/data/csv/csv-comma.csv"
csv_result = DocumentConverter().convert(source=CSV_SOURCE)
csv_doc = csv_result.document
print(f"Document has {len(list(csv_doc.iterate_items()))} items")
print("\nFirst few lines of the CSV table:")
print(csv_doc.export_to_markdown()[:500])
Document has 1 items First few lines of the CSV table: | Index | Customer Id | First Name | Last Name | Company | City | Country | Phone 1 | Phone 2 | Email | Subscription Date | Website | |---------|-----------------|--------------|-------------|---------------------------------|-------------------|----------------------------|------------------------|-----------------------|-----------------------------|-------
Now let's chunk this table with header repetition enabled. We'll use a small token limit to force the table to be split across multiple chunks.
from docling_core.transforms.chunker.hierarchical_chunker import (
ChunkingDocSerializer,
ChunkingSerializerProvider,
)
from docling_core.transforms.serializer.markdown import (
MarkdownParams,
MarkdownTableSerializer,
)
# Create a custom serializer provider that uses Markdown for tables
class MDTableSerializerProvider(ChunkingSerializerProvider):
def get_serializer(self, doc):
return ChunkingDocSerializer(
doc=doc,
table_serializer=MarkdownTableSerializer(),
params=MarkdownParams(compact_tables=True),
)
small_tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
max_tokens=200,
)
chunker_with_headers = HybridChunker(
tokenizer=small_tokenizer,
repeat_table_header=True, # Repeat headers in each chunk
serializer_provider=MDTableSerializerProvider(), # Use Markdown table format
)
csv_chunks = list(chunker_with_headers.chunk(csv_doc))
print(f"Total chunks created: {len(csv_chunks)}\n")
# Display the first few chunks to show header repetition
for i, chunk in enumerate(csv_chunks[:3], 1):
print(f"{'=' * 60}")
print(f"Chunk {i}:")
print(f"{'=' * 60}")
chunk_text = chunk.text
# Show first 300 characters of each chunk
preview = chunk_text[:300] + "..." if len(chunk_text) > 300 else chunk_text
print(preview)
print(f"\nTokens: {small_tokenizer.count_tokens(chunk_text)}")
print(f"Has table header: {chunk_text.startswith('|')}\n")
Total chunks created: 5 ============================================================ Chunk 1: ============================================================ | Index | Customer Id | First Name | Last Name | Company | City | Country | Phone 1 | Phone 2 | Email | Subscription Date | Website | | - | - | - | - | - | - | - | - | - | - | - | - || 1 | DD37Cf93aecA6Dc | Sheryl | Baxter | Rasmussen Group | East Leonard | Chile | 229.077.5154 | 397.884.0519x718 | ... Tokens: 131 Has table header: True ============================================================ Chunk 2: ============================================================ | Index | Customer Id | First Name | Last Name | Company | City | Country | Phone 1 | Phone 2 | Email | Subscription Date | Website | | - | - | - | - | - | - | - | - | - | - | - | - || 2 | 1Ef7b82A4CAAD10 | Preston | Lozano, Dr | Vega-Gentry | East Jimmychester | Djibouti | 5153435776 | 686-620-1820... Tokens: 132 Has table header: True ============================================================ Chunk 3: ============================================================ | Index | Customer Id | First Name | Last Name | Company | City | Country | Phone 1 | Phone 2 | Email | Subscription Date | Website | | - | - | - | - | - | - | - | - | - | - | - | - || 3 | 6F94879bDAfE5a6 | Roy | Berry | Murillo-Perry | Isabelborough | Antigua and Barbuda | +1-539-402-0259 | (496)97... Tokens: 141 Has table header: True
Each chunk starts with the table header row, ensuring that every chunk maintains the context of what each column represents. This is especially important when:
- Feeding chunks to an embedding model for semantic search
- Processing chunks independently in downstream tasks
- Working with wide tables that naturally span multiple chunks
For more advanced control over header handling in wide tables, including the omit_header_on_overflow parameter, see the Line-based chunking example.