Line-Based Token Chunking

Overview

The LineBasedTokenChunker is a tokenization-aware chunker that preserves line boundaries. It's particularly useful for structured content like tables, code, or logs where line boundaries are semantically important.

Key features: - Line preservation: Keeps entire lines within a single chunk when possible - Prefix support: Add repeated context (e.g., table headers) to each chunk - Overflow handling: Choose between splitting lines or omitting prefix when lines are too long

Setup

%pip install -qU pip docling transformers

Note: you may need to restart the kernel to use updated packages.

from docling_core.transforms.chunker.line_chunker import LineBasedTokenChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer

Example 1: Basic Table Chunking with Prefix

In this example, we'll chunk a table while repeating the header in each chunk.

# Setup tokenizer with a reasonable token limit
tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
    max_tokens=50,  # Small limit to demonstrate chunking
)

# Create chunker with table header prefix
chunker = LineBasedTokenChunker(
    tokenizer=tokenizer,
    prefix="| Name | Age | Department |\n|------|-----|------------|\n",
    omit_prefix_on_overflow=False,  # Always include prefix (default)
)

# Sample table rows
lines = [
    "| Alice | 30 | Engineering |\n",
    "| Bob | 25 | Marketing |\n",
    "| Charlie | 35 | Sales |\n",
    "| Diana | 28 | HR |\n",
    "| Eve | 32 | Finance |\n",
]

print(f"Max tokens: {chunker.max_tokens}")
print(f"Prefix token count: {chunker.prefix_len}\n")

chunks = chunker.chunk_text(lines)

print(f"Total chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"=== Chunk {i} ===")
    print(chunk)
    print(f"Tokens: {tokenizer.count_tokens(chunk)}\n")

Max tokens: 50
Prefix token count: 34

Total chunks: 3

=== Chunk 1 ===
| Name | Age | Department |
|------|-----|------------|
| Alice | 30 | Engineering |
| Bob | 25 | Marketing |

Tokens: 48

=== Chunk 2 ===
| Name | Age | Department |
|------|-----|------------|
| Charlie | 35 | Sales |
| Diana | 28 | HR |

Tokens: 48

=== Chunk 3 ===
| Name | Age | Department |
|------|-----|------------|
| Eve | 32 | Finance |

Tokens: 41

Example 2: Handling Wide Tables with `omit_prefix_on_overflow`

When working with wide tables, some rows might fit without the header but not with it. The omit_prefix_on_overflow parameter provides flexibility in these cases.

# Setup tokenizer with a very small token limit
tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
    max_tokens=30,  # Very small limit to force overflow
)

# Create chunker with a longer prefix
prefix = (
    "| Name | Age | Department | Location |\n|------|-----|------------|----------|\n"
)

print(f"Prefix token count: {tokenizer.count_tokens(prefix)}")
print(f"Max tokens: {tokenizer.get_max_tokens()}\n")

# Sample lines - some will be too long with prefix
lines = [
    "| Alice Johnson | 30 | Engineering | San Francisco |\n",
    "| Bob Smith | 25 | Marketing | New York |\n",
    "| Charlie Brown with a very long name | 35 | Sales Department | Los Angeles |\n",
]

# Check token counts for each line
print("Token counts:")
for i, line in enumerate(lines, 1):
    line_tokens = tokenizer.count_tokens(line)
    with_prefix = line_tokens + tokenizer.count_tokens(prefix)
    print(f"  Line {i}: {line_tokens} tokens (with prefix: {with_prefix} tokens)")
print()

Prefix token count: 47
Max tokens: 30

Token counts:
  Line 1: 11 tokens (with prefix: 58 tokens)
  Line 2: 11 tokens (with prefix: 58 tokens)
  Line 3: 17 tokens (with prefix: 64 tokens)

Without `omit_prefix_on_overflow` (default behavior)

chunker_no_omit = LineBasedTokenChunker(
    tokenizer=tokenizer,
    prefix=prefix,
    omit_prefix_on_overflow=False,  # Default: always include prefix
)

chunks_no_omit = chunker_no_omit.chunk_text(lines)

print("=" * 60)
print("WITHOUT omit_prefix_on_overflow (may split long lines)")
print("=" * 60)
print(f"\nTotal chunks: {len(chunks_no_omit)}\n")

for i, chunk in enumerate(chunks_no_omit, 1):
    print(f"--- Chunk {i} ---")
    print(chunk)
    print(f"Tokens: {tokenizer.count_tokens(chunk)}")
    print(f"Has prefix: {chunk.startswith(prefix)}\n")

============================================================
WITHOUT omit_prefix_on_overflow (may split long lines)
============================================================

Total chunks: 5

--- Chunk 1 ---

| Name | Age | Department | Location
Tokens: 8
Has prefix: False

--- Chunk 2 ---

 |
|------|-----|------------|--
Tokens: 30
Has prefix: False

--- Chunk 3 ---
--------|

Tokens: 9
Has prefix: False

--- Chunk 4 ---
| Alice Johnson | 30 | Engineering | San Francisco |
| Bob Smith | 25 | Marketing | New York |

Tokens: 22
Has prefix: False

--- Chunk 5 ---
| Charlie Brown with a very long name | 35 | Sales Department | Los Angeles |

Tokens: 17
Has prefix: False



/Users/anish/Desktop/Programs/docling/.venv/lib/python3.12/site-packages/docling_core/transforms/chunker/line_chunker.py:83: UserWarning: Chunks prefix is too long (47 tokens) for chunk size 30. It will be split into multiple chunks and only included in the first chunk(s). Consider increasing max_tokens to accommodate the full prefix in each chunk.
  warnings.warn(

With `omit_prefix_on_overflow=True`

chunker_with_omit = LineBasedTokenChunker(
    tokenizer=tokenizer,
    prefix=prefix,
    omit_prefix_on_overflow=True,  # Omit prefix for lines that would overflow
)

chunks_with_omit = chunker_with_omit.chunk_text(lines)

print("=" * 60)
print("WITH omit_prefix_on_overflow (keeps lines intact)")
print("=" * 60)
print(f"\nTotal chunks: {len(chunks_with_omit)}\n")

for i, chunk in enumerate(chunks_with_omit, 1):
    print(f"--- Chunk {i} ---")
    print(chunk)
    print(f"Tokens: {tokenizer.count_tokens(chunk)}")
    print(f"Has prefix: {chunk.startswith(prefix)}\n")

============================================================
WITH omit_prefix_on_overflow (keeps lines intact)
============================================================

Total chunks: 5

--- Chunk 1 ---

| Name | Age | Department | Location
Tokens: 8
Has prefix: False

--- Chunk 2 ---

 |
|------|-----|------------|--
Tokens: 30
Has prefix: False

--- Chunk 3 ---
--------|

Tokens: 9
Has prefix: False

--- Chunk 4 ---
| Alice Johnson | 30 | Engineering | San Francisco |
| Bob Smith | 25 | Marketing | New York |

Tokens: 22
Has prefix: False

--- Chunk 5 ---
| Charlie Brown with a very long name | 35 | Sales Department | Los Angeles |

Tokens: 17
Has prefix: False

Example 3: Chunking a DoclingDocument

The LineBasedTokenChunker can also be used directly with DoclingDocument objects.

from docling.document_converter import DocumentConverter

# Convert a document
converter = DocumentConverter()
result = converter.convert("../../tests/data/md/sources/wiki.md")
doc = result.document

# Create chunker
tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
    max_tokens=100,
)

chunker = LineBasedTokenChunker(
    tokenizer=tokenizer,
    prefix="",  # No prefix for general documents
)

# Chunk the document
chunks = list(chunker.chunk(doc))

print(f"Total chunks: {len(chunks)}\n")

# Display first few chunks
for i, chunk in enumerate(chunks[:3], 1):
    print(f"=== Chunk {i} ===")
    print(
        f"Text: {chunk.text[:200]}..."
        if len(chunk.text) > 200
        else f"Text: {chunk.text}"
    )
    print(f"Tokens: {tokenizer.count_tokens(chunk.text)}")
    print(f"Doc items: {len(chunk.meta.doc_items)}\n")

Total chunks: 11

=== Chunk 1 ===
Text: # IBM

International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American multinational technology company headquartered in Armonk, New York and present in over ...
Tokens: 57
Doc items: 12

=== Chunk 2 ===
Text: IBM is the largest industrial research organization in the world, with 19 research facilities across a dozen countries, having held the record for most annual U.S. patents generated by a business for ...
Tokens: 99
Doc items: 12

=== Chunk 3 ===
Text:  systems. During the 1960s and 1970s, the IBM mainframe, exemplified by the System/360, was the world's dominant computing platform, with the company producing 80 percent of computers in the U.S. and ...
Tokens: 100
Doc items: 12

Example 4: Handling Large Prefixes

When a prefix exceeds the max_tokens limit, it's automatically split into multiple chunks and only included at the beginning.

import warnings

# Create a very long prefix that exceeds max_tokens
tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
    max_tokens=25,  # Small limit
)

large_prefix = (
    "This is a very long table header that contains a lot of information " * 10
)

print(f"Large prefix token count: {tokenizer.count_tokens(large_prefix)} tokens")
print(f"Max tokens allowed: {tokenizer.get_max_tokens()} tokens\n")

# Create chunker with large prefix - will trigger warning
with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")

    chunker_large = LineBasedTokenChunker(
        tokenizer=tokenizer,
        prefix=large_prefix,
    )

    if w:
        print("⚠️  Warning issued:")
        print(f"   {w[0].message}\n")

print(f"Number of prefix chunks: {len(chunker_large.prefix_chunks)}")
print(f"Prefix len (for single chunk): {chunker_large.prefix_len}\n")

# Show the prefix chunks
print("Prefix chunks:")
for i, prefix_chunk in enumerate(chunker_large.prefix_chunks, 1):
    preview = prefix_chunk[:100] + "..." if len(prefix_chunk) > 100 else prefix_chunk
    print(f"  Chunk {i}: {tokenizer.count_tokens(prefix_chunk)} tokens")
    print(f"  Content: {preview}\n")

# Test chunking with the large prefix
lines = [
    "Row 1: Some data here\n",
    "Row 2: More data here\n",
    "Row 3: Even more data\n",
]

chunks_large = chunker_large.chunk_text(lines)

print(f"Total chunks (including prefix chunks): {len(chunks_large)}")
print(f"Content chunks: {len(chunks_large) - len(chunker_large.prefix_chunks)}\n")

# Display all chunks
for i, chunk in enumerate(chunks_large, 1):
    is_prefix_chunk = i <= len(chunker_large.prefix_chunks)
    chunk_type = "[PREFIX CHUNK]" if is_prefix_chunk else "[CONTENT CHUNK]"

    print(f"Chunk {i} {chunk_type}:")
    preview = chunk[:100] + "..." if len(chunk) > 100 else chunk
    print(f"  Content: {preview}")
    print(f"  Tokens: {tokenizer.count_tokens(chunk)}\n")

Large prefix token count: 130 tokens
Max tokens allowed: 25 tokens

⚠️  Warning issued:
   Chunks prefix is too long (130 tokens) for chunk size 25. It will be split into multiple chunks and only included in the first chunk(s). Consider increasing max_tokens to accommodate the full prefix in each chunk.

Number of prefix chunks: 6
Prefix len (for single chunk): 0

Prefix chunks:
  Chunk 1: 25 tokens
  Content: 
This is a very long table header that contains a lot of information This is a very long table heade...

  Chunk 2: 25 tokens
  Content: 
 information This is a very long table header that contains a lot of information This is a very lon...

  Chunk 3: 25 tokens
  Content: 
 of information This is a very long table header that contains a lot of information This is a very ...

  Chunk 4: 24 tokens
  Content: 
 lot of information This is a very long table header that contains a lot of information This is a v...

  Chunk 5: 24 tokens
  Content: 
 contains a lot of information This is a very long table header that contains a lot of information ...

  Chunk 6: 7 tokens
  Content:  header that contains a lot of information

Total chunks (including prefix chunks): 7
Content chunks: 1

Chunk 1 [PREFIX CHUNK]:
  Content: 
This is a very long table header that contains a lot of information This is a very long table heade...
  Tokens: 25

Chunk 2 [PREFIX CHUNK]:
  Content: 
 information This is a very long table header that contains a lot of information This is a very lon...
  Tokens: 25

Chunk 3 [PREFIX CHUNK]:
  Content: 
 of information This is a very long table header that contains a lot of information This is a very ...
  Tokens: 25

Chunk 4 [PREFIX CHUNK]:
  Content: 
 lot of information This is a very long table header that contains a lot of information This is a v...
  Tokens: 24

Chunk 5 [PREFIX CHUNK]:
  Content: 
 contains a lot of information This is a very long table header that contains a lot of information ...
  Tokens: 24

Chunk 6 [PREFIX CHUNK]:
  Content:  header that contains a lot of information 
  Tokens: 7

Chunk 7 [CONTENT CHUNK]:
  Content: Row 1: Some data here
Row 2: More data here
Row 3: Even more data

  Tokens: 18

Summary

When to use `LineBasedTokenChunker`

You need to preserve line boundaries (tables, code, logs)
You want to add context (headers, metadata) to each chunk
You're working with structured text where lines have semantic meaning
You need fine-grained control over how lines are split

When to use `omit_prefix_on_overflow=True`

Working with wide tables or long prefixes
Token budget is limited
Line integrity is more important than consistent formatting
You can handle chunks without the prefix in downstream processing

When to use `omit_prefix_on_overflow=False` (default)

You need the prefix in every chunk for context
Consistent formatting is critical
Downstream processing requires the prefix to understand the content
Working with narrow content where the prefix doesn't cause overflow

Line-Based Token Chunking

Overview

Setup

Example 1: Basic Table Chunking with Prefix

Example 2: Handling Wide Tables with omit_prefix_on_overflow

Without omit_prefix_on_overflow (default behavior)

With omit_prefix_on_overflow=True