Skip to content

Chunking & tokenization with Data Prep Kit

This notebook demonstrates how to build a sequence of DPK transforms for ingesting HTML documents using Docling2Parquet transforms and chunking them using Doc_Chunk transform. Both transforms are based on the Docling library.

In this example, we will use the Wikimedia API to retrieve the HTML articles that will be used as a seed for our LLM application. Once the articles are loaded to a local cache, we will construct and invoke the sequence of transforms to ingest the content and produce the embedding for the chuncked content.

🔍 Why DPK Pipelines

DPK transform pipelines are intended to simplify how any number of transforms can be executed in a sequence to ingest, annotate, filter and create embedding used for LLM post-training and RAG applications.

🧰 Key Transforms in This Recipe

We will use the following transforms from DPK:

  • Docling2Parquet: Ingest one or more HTML document and turn it into a parquet file.
  • Doc_Chunk: Create chunks from one more more ducment.
  • Tokenization: Create embedding for document chunks.

Prerequisites

1- This notebook uses Wikimedia API for retrieving the initial HTML documents and llama-tokenizer from hugging face.

2- In order to use the notebook, users must provide a .env file with a valid access tokens to be used for accessing the wikimedia endpoint ( instructions can be found here ) and a Hugging face token for loading the model ( instructions can be found here). The .env file will look something like this:

WIKI_ACCESS_TOKEN='eyxxx'
HF_READ_ACCESS_TOKEN='hf_xxx'

3- Install DPK library to environment

%%capture
%pip install "data-prep-toolkit-transforms[docling2parquet,doc_chunk,tokenization]"
%pip install pandas
%pip install "numpy<2.0"
from dotenv import load_dotenv

load_dotenv(".env", override=True)

We will define and use a utility function for downloading the articles and saving them to the local disk:

load_corpus: Uses http request with the wikimedia api token to connect to a Wikimedia endpoint and retrieve the HTML articles that will be used as a seed for our LLM application. The article will then be saved to a local cache folder for further processing

def load_corpus(articles: list, folder: str) -> int:
    import os
    import re

    import requests

    headers = {"Authorization": f"Bearer {os.getenv('WIKI_ACCESS_TOKEN')}"}
    count = 0
    for article in articles:
        try:
            endpoint = f"https://api.enterprise.wikimedia.com/v2/articles/{article}"
            response = requests.get(endpoint, headers=headers)
            response.raise_for_status()
            doc = response.json()
            for article in doc:
                filename = re.sub(r"[^a-zA-Z0-9_]", "_", article["name"])
                with open(f"{folder}/{filename}.html", "w") as f:
                    f.write(article["article_body"]["html"])
                    count = count + 1
        except Exception as e:
            print(f"Failed to retrieve content: {e}")
    return count

🔗 Setup the experiment

DPK requires that we define a source/input folder where the transform sequence will be ingesting the document and a destination/output folder where the embedding will be stored. We will also initialize the list of articles we want to use in our application

import os
import tempfile

datafolder = tempfile.mkdtemp(dir=os.getcwd())
articles = ["Science,_technology,_engineering,_and_mathematics"]
assert load_corpus(articles, datafolder) > 0, "Faild to download any documents"

🔗 Injest

Invoke Docling2Parquet tansform that will parse the HTML document and create a Markdown

%%capture
from dpk_docling2parquet import Docling2Parquet, docling2parquet_contents_types

result = Docling2Parquet(
    input_folder=datafolder,
    output_folder=f"{datafolder}/docling2parquet",
    data_files_to_use=[".html"],
    docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN,  # markdown
).transform()

🔗 Chunk

Invoke DocChunk tansform to break the HTML document into chunks

%%capture
from dpk_doc_chunk import DocChunk

result = DocChunk(
    input_folder=f"{datafolder}/docling2parquet",
    output_folder=f"{datafolder}/doc_chunk",
    doc_chunk_chunking_type="li_markdown",
    doc_chunk_chunk_size_tokens=128,  # default 128
    doc_chunk_chunk_overlap_tokens=30,  # default 30
).transform()

🔗 Tokenization

Invoke Tokenization transform to create embedding of various chunks

%%capture
from dpk_tokenization import Tokenization

Tokenization(
    input_folder=f"{datafolder}/doc_chunk",
    output_folder=f"{datafolder}/tkn",
    tkn_tokenizer="hf-internal-testing/llama-tokenizer",
    tkn_chunk_size=20_000,
).transform()

✅ Summary

This notebook demonstrated how to run a DPK pipeline using IBM's Data Prep Kit and the Docling library. Each transform create one or more parquet files that users can explore to better understand what each stage of the pipeline produces. The see the output of the final stage, we will use Pandas to read the final parquet file and display its content

from pathlib import Path

import pandas as pd

parquet_files = list(Path(f"{datafolder}/tkn/").glob("*.parquet"))
pd.concat(pd.read_parquet(file) for file in parquet_files)
tokens document_id document_length token_count
0 [1, 444, 11814, 262, 3002] f1f5b56a78829ab2165b3bbeb94b1167e4c5583c437f1d... 14 5
1 [1, 835, 5298, 13, 13, 797, 278, 4688, 29871, ... 402e82a9e81cc3d2494fac36bebf8bf1a2662800e5a00c... 2100 655
2 [1, 835, 5901, 21833, 13, 13, 29899, 321, 1254... 4fb389d0f0e999c2496f137b4a7c0671e79c09cf9477e9... 2833 968
3 [1, 444, 26304, 4978, 13, 13, 14136, 1967, 666... 3709997548d84224361a6835760b5ae48a1637e78d54a0... 1496 483
4 [1, 444, 2648, 4234] 1e1a58ad5664d963bc207dc791825258c33337c2559f6a... 13 4
5 [1, 835, 8314, 13, 13, 1576, 9870, 315, 1038, ... 83a63864e5ddfdd41ef0f813fb7aa3c95e04c029c32ab3... 1340 442
6 [1, 835, 7400, 13, 13, 6028, 1114, 27871, 2987... 5e29fb4e4cf37ed4c49994620e4a00da9693bc061e82c1... 1800 548
7 [1, 835, 7551, 13, 13, 25411, 3762, 8950, 6020... 3fc34013d93391a7504e84069190479fbc85ba7e7072cb... 1784 511
8 [1, 835, 4092, 13, 13, 13393, 884, 29901, 518,... e8b28e20e3fc3da40b6b368e30f9c953f5218370ec2f7a... 774 229
9 [1, 3191, 18312, 13, 13, 1576, 365, 29965, 152... 94b54fbda274536622f70442b18126f554610e8915b235... 1076 263
10 [1, 3191, 3444, 13, 13, 1576, 1024, 310, 317, ... fef9b66567944df131851834e2fdfb42b5c668e4b08031... 238 60
11 [1, 835, 12798, 12026, 13, 13, 1254, 12665, 97... eeb74ae3490539aa07f25987b6b2666dc907b39147e810... 366 97
12 [1, 835, 7513, 13, 13, 19302, 284, 2879, 515, ... cc2ccd2e9f4d0a8224716109f7a6e7b30f33ff1f8c7adf... 1395 402
13 [1, 835, 20537, 423, 13, 13, 797, 20537, 423, ... baf13788a018da24d86b630a9032eaeee54913bbbdd0d4... 511 137
14 [1, 835, 21215, 13, 13, 1254, 12665, 17800, 52... a5b3973ab3a98d10f4ae07a004d70c6cdcfacb41fda8d7... 1949 536
15 [1, 835, 26260, 13, 13, 797, 278, 518, 4819, 2... dfa35b16704a4dd549701a7821b6aa856f2dd5e5b69daf... 1042 291
16 [1, 835, 660, 14873, 13, 13, 797, 518, 29984, ... a0809b265e4a011407d38cd06c7b3ce5932683a2f9c6af... 852 282
17 [1, 835, 25960, 13, 13, 1254, 12665, 338, 760,... 85e8f3b2af3268d49e60451d3ac87b3bd281a70cf6c4b7... 1165 285
18 [1, 835, 498, 26517, 13, 13, 797, 29871, 29906... 15c924efdbf0135de91a095237cbe831275bab67ee1371... 1612 397
19 [1, 835, 26459, 13, 13, 29911, 29641, 728, 317... b473b50753dd07f08da05bbf776c57747ab85ba79cb081... 435 145
20 [1, 835, 3303, 3900, 13, 13, 797, 278, 3303, 3... 841cefc910bd5d1920187b23554ee67e0e65563373e6de... 1212 344
21 [1, 3191, 3086, 9327, 10606, 13, 13, 14804, 25... 63924939eab38ad6636495f1c5c13760014efe42b330a6... 1592 416
22 [1, 3191, 1954, 29885, 16783, 8898, 13, 13, 24... 44288e766c343592a44f3da59ad3b57a9f26096ac13412... 1653 465
23 [1, 3191, 13151, 13, 13, 13393, 884, 29901, 51... 40a0f6e213901d92f1a158c3e2a55ad2558eb1deaa973f... 4418 1285
24 [1, 3191, 6981, 1455, 17261, 297, 317, 4330, 2... 5cc92a05d39ee56e9c65cdb00f55bc9dcbe8bc1647a442... 1289 375
25 [1, 3191, 402, 1581, 330, 2547, 297, 317, 4330... 37c88bed7898d9a7406b5b0e4b1ccfaca65a732dff0c03... 821 280
26 [1, 3191, 4124, 2042, 2877, 297, 317, 4330, 29... f144b97af462b2ab8aba5cb6d9cba0cf5f383cc710aba0... 1093 297
27 [1, 3191, 3082, 24620, 277, 20193, 512, 4812, ... 16525e2054a7bb7543308ad4e6642bf60e66dc475a0e0a... 2203 538
28 [1, 3191, 317, 4330, 29924, 13151, 3189, 284, ... ebb319391e1bda81edd5ec214887150044c15cfc04f42f... 514 149
29 [1, 3191, 2522, 449, 292, 13, 13, 797, 29871, ... 882582d1f6202a4e495f67952d3a27929177745b1f575e... 850 261
30 [1, 3191, 10317, 310, 5282, 1947, 11104, 13, 1... 311aa5c91354b6bf575682be701981ccc6569eb35fd726... 1561 416
31 [1, 3191, 24206, 13, 13, 1254, 12665, 23992, 2... abaa73aba997ea267d9b556679c5d680810ee5baa231fa... 384 139
32 [1, 3191, 18991, 362, 13, 13, 1576, 518, 29048... 00f85d6dffd914d89eb44dbb4caa3a1c6b2af47f5c4c96... 878 247
33 [1, 3191, 17163, 29879, 13, 13, 797, 3979, 298... f8d901fca6dcac6c266cf2799da814c5f5b5644c3b9476... 2321 682
34 [1, 3191, 3599, 296, 6728, 13, 13, 7504, 3278,... 8347c4988e3acde4723696fbf63a0f2c13d61e92c8fbac... 2960 841
35 [1, 3191, 28488, 322, 11104, 304, 1371, 2693, ... c3d0c80c861ffcd422f60b78d693bb953b69dfc3c3d55f... 222 81
36 [1, 835, 18444, 13, 13, 797, 18444, 29892, 676... 9c41677100393c4e5e3bc4bc36caee5561cb5c93546aaf... 1143 288
37 [1, 444, 10152, 13, 13, 6330, 7456, 29901, 518... 83f0f668bac5736d5f23f750f86ebbe173c0a56e3c51b8... 2777 833
38 [1, 444, 365, 7210, 29911, 29984, 29974, 13, 1... 24bbfff971979686cd41132b491060bdaaf357bd3bc7cf... 2579 847
39 [1, 444, 15976, 293, 1608, 13, 13, 1576, 8569,... 1b8c147d642e4d53152e1be73223ed58e0788700d82c73... 4700 1299
40 [1, 444, 2823, 884, 13, 13, 29899, 518, 29907,... ac3fb4073323718ea3e32e006ed67c298af9801c4a03dd... 1310 443
41 [1, 444, 28318, 13, 13, 29896, 29889, 518, 298... 2dad03b0e2b81c47012f94be0ab730e9c8341f0311c59e... 59373 26470
42 [1, 444, 8725, 5183, 13, 13, 29899, 4699, 1522... 07dabd1b5cfa6f8c70f97eb33c3a19189a866eae1203c7... 2648 1075
43 [1, 444, 3985, 2988, 13, 13, 29899, 8213, 4475... ef8cc66ae18d7238680d07372859c5be061d57b955cf7d... 5025 705