Line-Based Token Chunking¶
Overview¶
The LineBasedTokenChunker is a tokenization-aware chunker that preserves line boundaries. It's particularly useful for structured content like tables, code, or logs where line boundaries are semantically important.
Key features:
- Line preservation: Keeps entire lines within a single chunk when possible
- Prefix support: Add repeated context (e.g., table headers) to each chunk
- Overflow handling: Choose between splitting lines or omitting prefix when lines are too long
Setup¶
In [1]:
Copied!
%pip install -qU pip docling transformers
%pip install -qU pip docling transformers
Note: you may need to restart the kernel to use updated packages.
In [2]:
Copied!
from docling_core.transforms.chunker.line_chunker import LineBasedTokenChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer
from docling_core.transforms.chunker.line_chunker import LineBasedTokenChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer
Example 1: Basic Table Chunking with Prefix¶
In this example, we'll chunk a table while repeating the header in each chunk.
In [3]:
Copied!
# Setup tokenizer with a reasonable token limit
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
max_tokens=50, # Small limit to demonstrate chunking
)
# Create chunker with table header prefix
chunker = LineBasedTokenChunker(
tokenizer=tokenizer,
prefix="| Name | Age | Department |\n|------|-----|------------|\n",
omit_prefix_on_overflow=False, # Always include prefix (default)
)
# Sample table rows
lines = [
"| Alice | 30 | Engineering |\n",
"| Bob | 25 | Marketing |\n",
"| Charlie | 35 | Sales |\n",
"| Diana | 28 | HR |\n",
"| Eve | 32 | Finance |\n",
]
print(f"Max tokens: {chunker.max_tokens}")
print(f"Prefix token count: {chunker.prefix_len}\n")
chunks = chunker.chunk_text(lines)
print(f"Total chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
print(f"=== Chunk {i} ===")
print(chunk)
print(f"Tokens: {tokenizer.count_tokens(chunk)}\n")
# Setup tokenizer with a reasonable token limit
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
max_tokens=50, # Small limit to demonstrate chunking
)
# Create chunker with table header prefix
chunker = LineBasedTokenChunker(
tokenizer=tokenizer,
prefix="| Name | Age | Department |\n|------|-----|------------|\n",
omit_prefix_on_overflow=False, # Always include prefix (default)
)
# Sample table rows
lines = [
"| Alice | 30 | Engineering |\n",
"| Bob | 25 | Marketing |\n",
"| Charlie | 35 | Sales |\n",
"| Diana | 28 | HR |\n",
"| Eve | 32 | Finance |\n",
]
print(f"Max tokens: {chunker.max_tokens}")
print(f"Prefix token count: {chunker.prefix_len}\n")
chunks = chunker.chunk_text(lines)
print(f"Total chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
print(f"=== Chunk {i} ===")
print(chunk)
print(f"Tokens: {tokenizer.count_tokens(chunk)}\n")
Max tokens: 50 Prefix token count: 34 Total chunks: 3 === Chunk 1 === | Name | Age | Department | |------|-----|------------| | Alice | 30 | Engineering | | Bob | 25 | Marketing | Tokens: 48 === Chunk 2 === | Name | Age | Department | |------|-----|------------| | Charlie | 35 | Sales | | Diana | 28 | HR | Tokens: 48 === Chunk 3 === | Name | Age | Department | |------|-----|------------| | Eve | 32 | Finance | Tokens: 41
Example 2: Handling Wide Tables with omit_prefix_on_overflow¶
When working with wide tables, some rows might fit without the header but not with it. The omit_prefix_on_overflow parameter provides flexibility in these cases.
In [4]:
Copied!
# Setup tokenizer with a very small token limit
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
max_tokens=30, # Very small limit to force overflow
)
# Create chunker with a longer prefix
prefix = (
"| Name | Age | Department | Location |\n|------|-----|------------|----------|\n"
)
print(f"Prefix token count: {tokenizer.count_tokens(prefix)}")
print(f"Max tokens: {tokenizer.get_max_tokens()}\n")
# Sample lines - some will be too long with prefix
lines = [
"| Alice Johnson | 30 | Engineering | San Francisco |\n",
"| Bob Smith | 25 | Marketing | New York |\n",
"| Charlie Brown with a very long name | 35 | Sales Department | Los Angeles |\n",
]
# Check token counts for each line
print("Token counts:")
for i, line in enumerate(lines, 1):
line_tokens = tokenizer.count_tokens(line)
with_prefix = line_tokens + tokenizer.count_tokens(prefix)
print(f" Line {i}: {line_tokens} tokens (with prefix: {with_prefix} tokens)")
print()
# Setup tokenizer with a very small token limit
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
max_tokens=30, # Very small limit to force overflow
)
# Create chunker with a longer prefix
prefix = (
"| Name | Age | Department | Location |\n|------|-----|------------|----------|\n"
)
print(f"Prefix token count: {tokenizer.count_tokens(prefix)}")
print(f"Max tokens: {tokenizer.get_max_tokens()}\n")
# Sample lines - some will be too long with prefix
lines = [
"| Alice Johnson | 30 | Engineering | San Francisco |\n",
"| Bob Smith | 25 | Marketing | New York |\n",
"| Charlie Brown with a very long name | 35 | Sales Department | Los Angeles |\n",
]
# Check token counts for each line
print("Token counts:")
for i, line in enumerate(lines, 1):
line_tokens = tokenizer.count_tokens(line)
with_prefix = line_tokens + tokenizer.count_tokens(prefix)
print(f" Line {i}: {line_tokens} tokens (with prefix: {with_prefix} tokens)")
print()
Prefix token count: 47 Max tokens: 30 Token counts: Line 1: 11 tokens (with prefix: 58 tokens) Line 2: 11 tokens (with prefix: 58 tokens) Line 3: 17 tokens (with prefix: 64 tokens)
Without omit_prefix_on_overflow (default behavior)¶
In [5]:
Copied!
chunker_no_omit = LineBasedTokenChunker(
tokenizer=tokenizer,
prefix=prefix,
omit_prefix_on_overflow=False, # Default: always include prefix
)
chunks_no_omit = chunker_no_omit.chunk_text(lines)
print("=" * 60)
print("WITHOUT omit_prefix_on_overflow (may split long lines)")
print("=" * 60)
print(f"\nTotal chunks: {len(chunks_no_omit)}\n")
for i, chunk in enumerate(chunks_no_omit, 1):
print(f"--- Chunk {i} ---")
print(chunk)
print(f"Tokens: {tokenizer.count_tokens(chunk)}")
print(f"Has prefix: {chunk.startswith(prefix)}\n")
chunker_no_omit = LineBasedTokenChunker(
tokenizer=tokenizer,
prefix=prefix,
omit_prefix_on_overflow=False, # Default: always include prefix
)
chunks_no_omit = chunker_no_omit.chunk_text(lines)
print("=" * 60)
print("WITHOUT omit_prefix_on_overflow (may split long lines)")
print("=" * 60)
print(f"\nTotal chunks: {len(chunks_no_omit)}\n")
for i, chunk in enumerate(chunks_no_omit, 1):
print(f"--- Chunk {i} ---")
print(chunk)
print(f"Tokens: {tokenizer.count_tokens(chunk)}")
print(f"Has prefix: {chunk.startswith(prefix)}\n")
============================================================ WITHOUT omit_prefix_on_overflow (may split long lines) ============================================================ Total chunks: 5 --- Chunk 1 --- | Name | Age | Department | Location Tokens: 8 Has prefix: False --- Chunk 2 --- | |------|-----|------------|-- Tokens: 30 Has prefix: False --- Chunk 3 --- --------| Tokens: 9 Has prefix: False --- Chunk 4 --- | Alice Johnson | 30 | Engineering | San Francisco | | Bob Smith | 25 | Marketing | New York | Tokens: 22 Has prefix: False --- Chunk 5 --- | Charlie Brown with a very long name | 35 | Sales Department | Los Angeles | Tokens: 17 Has prefix: False
/Users/anish/Desktop/Programs/docling/.venv/lib/python3.12/site-packages/docling_core/transforms/chunker/line_chunker.py:83: UserWarning: Chunks prefix is too long (47 tokens) for chunk size 30. It will be split into multiple chunks and only included in the first chunk(s). Consider increasing max_tokens to accommodate the full prefix in each chunk. warnings.warn(
With omit_prefix_on_overflow=True¶
In [6]:
Copied!
chunker_with_omit = LineBasedTokenChunker(
tokenizer=tokenizer,
prefix=prefix,
omit_prefix_on_overflow=True, # Omit prefix for lines that would overflow
)
chunks_with_omit = chunker_with_omit.chunk_text(lines)
print("=" * 60)
print("WITH omit_prefix_on_overflow (keeps lines intact)")
print("=" * 60)
print(f"\nTotal chunks: {len(chunks_with_omit)}\n")
for i, chunk in enumerate(chunks_with_omit, 1):
print(f"--- Chunk {i} ---")
print(chunk)
print(f"Tokens: {tokenizer.count_tokens(chunk)}")
print(f"Has prefix: {chunk.startswith(prefix)}\n")
chunker_with_omit = LineBasedTokenChunker(
tokenizer=tokenizer,
prefix=prefix,
omit_prefix_on_overflow=True, # Omit prefix for lines that would overflow
)
chunks_with_omit = chunker_with_omit.chunk_text(lines)
print("=" * 60)
print("WITH omit_prefix_on_overflow (keeps lines intact)")
print("=" * 60)
print(f"\nTotal chunks: {len(chunks_with_omit)}\n")
for i, chunk in enumerate(chunks_with_omit, 1):
print(f"--- Chunk {i} ---")
print(chunk)
print(f"Tokens: {tokenizer.count_tokens(chunk)}")
print(f"Has prefix: {chunk.startswith(prefix)}\n")
============================================================ WITH omit_prefix_on_overflow (keeps lines intact) ============================================================ Total chunks: 5 --- Chunk 1 --- | Name | Age | Department | Location Tokens: 8 Has prefix: False --- Chunk 2 --- | |------|-----|------------|-- Tokens: 30 Has prefix: False --- Chunk 3 --- --------| Tokens: 9 Has prefix: False --- Chunk 4 --- | Alice Johnson | 30 | Engineering | San Francisco | | Bob Smith | 25 | Marketing | New York | Tokens: 22 Has prefix: False --- Chunk 5 --- | Charlie Brown with a very long name | 35 | Sales Department | Los Angeles | Tokens: 17 Has prefix: False
Example 3: Chunking a DoclingDocument¶
The LineBasedTokenChunker can also be used directly with DoclingDocument objects.
In [7]:
Copied!
from docling.document_converter import DocumentConverter
# Convert a document
converter = DocumentConverter()
result = converter.convert("../../tests/data/md/wiki.md")
doc = result.document
# Create chunker
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
max_tokens=100,
)
chunker = LineBasedTokenChunker(
tokenizer=tokenizer,
prefix="", # No prefix for general documents
)
# Chunk the document
chunks = list(chunker.chunk(doc))
print(f"Total chunks: {len(chunks)}\n")
# Display first few chunks
for i, chunk in enumerate(chunks[:3], 1):
print(f"=== Chunk {i} ===")
print(
f"Text: {chunk.text[:200]}..."
if len(chunk.text) > 200
else f"Text: {chunk.text}"
)
print(f"Tokens: {tokenizer.count_tokens(chunk.text)}")
print(f"Doc items: {len(chunk.meta.doc_items)}\n")
from docling.document_converter import DocumentConverter
# Convert a document
converter = DocumentConverter()
result = converter.convert("../../tests/data/md/wiki.md")
doc = result.document
# Create chunker
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
max_tokens=100,
)
chunker = LineBasedTokenChunker(
tokenizer=tokenizer,
prefix="", # No prefix for general documents
)
# Chunk the document
chunks = list(chunker.chunk(doc))
print(f"Total chunks: {len(chunks)}\n")
# Display first few chunks
for i, chunk in enumerate(chunks[:3], 1):
print(f"=== Chunk {i} ===")
print(
f"Text: {chunk.text[:200]}..."
if len(chunk.text) > 200
else f"Text: {chunk.text}"
)
print(f"Tokens: {tokenizer.count_tokens(chunk.text)}")
print(f"Doc items: {len(chunk.meta.doc_items)}\n")
Total chunks: 11 === Chunk 1 === Text: # IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over ... Tokens: 57 Doc items: 12 === Chunk 2 === Text: IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for ... Tokens: 99 Doc items: 12 === Chunk 3 === Text: systems. During the 1960s and 1970s, the IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and ... Tokens: 100 Doc items: 12
Example 4: Handling Large Prefixes¶
When a prefix exceeds the max_tokens limit, it's automatically split into multiple chunks and only included at the beginning.
In [8]:
Copied!
import warnings
# Create a very long prefix that exceeds max_tokens
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
max_tokens=25, # Small limit
)
large_prefix = (
"This is a very long table header that contains a lot of information " * 10
)
print(f"Large prefix token count: {tokenizer.count_tokens(large_prefix)} tokens")
print(f"Max tokens allowed: {tokenizer.get_max_tokens()} tokens\n")
# Create chunker with large prefix - will trigger warning
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
chunker_large = LineBasedTokenChunker(
tokenizer=tokenizer,
prefix=large_prefix,
)
if w:
print("⚠️ Warning issued:")
print(f" {w[0].message}\n")
print(f"Number of prefix chunks: {len(chunker_large.prefix_chunks)}")
print(f"Prefix len (for single chunk): {chunker_large.prefix_len}\n")
# Show the prefix chunks
print("Prefix chunks:")
for i, prefix_chunk in enumerate(chunker_large.prefix_chunks, 1):
preview = prefix_chunk[:100] + "..." if len(prefix_chunk) > 100 else prefix_chunk
print(f" Chunk {i}: {tokenizer.count_tokens(prefix_chunk)} tokens")
print(f" Content: {preview}\n")
# Test chunking with the large prefix
lines = [
"Row 1: Some data here\n",
"Row 2: More data here\n",
"Row 3: Even more data\n",
]
chunks_large = chunker_large.chunk_text(lines)
print(f"Total chunks (including prefix chunks): {len(chunks_large)}")
print(f"Content chunks: {len(chunks_large) - len(chunker_large.prefix_chunks)}\n")
# Display all chunks
for i, chunk in enumerate(chunks_large, 1):
is_prefix_chunk = i <= len(chunker_large.prefix_chunks)
chunk_type = "[PREFIX CHUNK]" if is_prefix_chunk else "[CONTENT CHUNK]"
print(f"Chunk {i} {chunk_type}:")
preview = chunk[:100] + "..." if len(chunk) > 100 else chunk
print(f" Content: {preview}")
print(f" Tokens: {tokenizer.count_tokens(chunk)}\n")
import warnings
# Create a very long prefix that exceeds max_tokens
tokenizer = HuggingFaceTokenizer(
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
max_tokens=25, # Small limit
)
large_prefix = (
"This is a very long table header that contains a lot of information " * 10
)
print(f"Large prefix token count: {tokenizer.count_tokens(large_prefix)} tokens")
print(f"Max tokens allowed: {tokenizer.get_max_tokens()} tokens\n")
# Create chunker with large prefix - will trigger warning
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
chunker_large = LineBasedTokenChunker(
tokenizer=tokenizer,
prefix=large_prefix,
)
if w:
print("⚠️ Warning issued:")
print(f" {w[0].message}\n")
print(f"Number of prefix chunks: {len(chunker_large.prefix_chunks)}")
print(f"Prefix len (for single chunk): {chunker_large.prefix_len}\n")
# Show the prefix chunks
print("Prefix chunks:")
for i, prefix_chunk in enumerate(chunker_large.prefix_chunks, 1):
preview = prefix_chunk[:100] + "..." if len(prefix_chunk) > 100 else prefix_chunk
print(f" Chunk {i}: {tokenizer.count_tokens(prefix_chunk)} tokens")
print(f" Content: {preview}\n")
# Test chunking with the large prefix
lines = [
"Row 1: Some data here\n",
"Row 2: More data here\n",
"Row 3: Even more data\n",
]
chunks_large = chunker_large.chunk_text(lines)
print(f"Total chunks (including prefix chunks): {len(chunks_large)}")
print(f"Content chunks: {len(chunks_large) - len(chunker_large.prefix_chunks)}\n")
# Display all chunks
for i, chunk in enumerate(chunks_large, 1):
is_prefix_chunk = i <= len(chunker_large.prefix_chunks)
chunk_type = "[PREFIX CHUNK]" if is_prefix_chunk else "[CONTENT CHUNK]"
print(f"Chunk {i} {chunk_type}:")
preview = chunk[:100] + "..." if len(chunk) > 100 else chunk
print(f" Content: {preview}")
print(f" Tokens: {tokenizer.count_tokens(chunk)}\n")
Large prefix token count: 130 tokens Max tokens allowed: 25 tokens ⚠️ Warning issued: Chunks prefix is too long (130 tokens) for chunk size 25. It will be split into multiple chunks and only included in the first chunk(s). Consider increasing max_tokens to accommodate the full prefix in each chunk. Number of prefix chunks: 6 Prefix len (for single chunk): 0 Prefix chunks: Chunk 1: 25 tokens Content: This is a very long table header that contains a lot of information This is a very long table heade... Chunk 2: 25 tokens Content: information This is a very long table header that contains a lot of information This is a very lon... Chunk 3: 25 tokens Content: of information This is a very long table header that contains a lot of information This is a very ... Chunk 4: 24 tokens Content: lot of information This is a very long table header that contains a lot of information This is a v... Chunk 5: 24 tokens Content: contains a lot of information This is a very long table header that contains a lot of information ... Chunk 6: 7 tokens Content: header that contains a lot of information Total chunks (including prefix chunks): 7 Content chunks: 1 Chunk 1 [PREFIX CHUNK]: Content: This is a very long table header that contains a lot of information This is a very long table heade... Tokens: 25 Chunk 2 [PREFIX CHUNK]: Content: information This is a very long table header that contains a lot of information This is a very lon... Tokens: 25 Chunk 3 [PREFIX CHUNK]: Content: of information This is a very long table header that contains a lot of information This is a very ... Tokens: 25 Chunk 4 [PREFIX CHUNK]: Content: lot of information This is a very long table header that contains a lot of information This is a v... Tokens: 24 Chunk 5 [PREFIX CHUNK]: Content: contains a lot of information This is a very long table header that contains a lot of information ... Tokens: 24 Chunk 6 [PREFIX CHUNK]: Content: header that contains a lot of information Tokens: 7 Chunk 7 [CONTENT CHUNK]: Content: Row 1: Some data here Row 2: More data here Row 3: Even more data Tokens: 18
Summary¶
When to use LineBasedTokenChunker¶
- You need to preserve line boundaries (tables, code, logs)
- You want to add context (headers, metadata) to each chunk
- You're working with structured text where lines have semantic meaning
- You need fine-grained control over how lines are split
When to use omit_prefix_on_overflow=True¶
- Working with wide tables or long prefixes
- Token budget is limited
- Line integrity is more important than consistent formatting
- You can handle chunks without the prefix in downstream processing
When to use omit_prefix_on_overflow=False (default)¶
- You need the prefix in every chunk for context
- Consistent formatting is critical
- Downstream processing requires the prefix to understand the content
- Working with narrow content where the prefix doesn't cause overflow