Chunking & tokenization with Data Prep Kit

This notebook demonstrates how to build a sequence of DPK transforms for ingesting HTML documents using Docling2Parquet transforms and chunking them using Doc_Chunk transform. Both transforms are based on the Docling library.

In this example, we will use the Wikimedia API to retrieve the HTML articles that will be used as a seed for our LLM application. Once the articles are loaded to a local cache, we will construct and invoke the sequence of transforms to ingest the content and produce the embedding for the chuncked content.

🔍 Why DPK Pipelines

DPK transform pipelines are intended to simplify how any number of transforms can be executed in a sequence to ingest, annotate, filter and create embedding used for LLM post-training and RAG applications.

🧰 Key Transforms in This Recipe

We will use the following transforms from DPK:

Docling2Parquet: Ingest one or more HTML document and turn it into a parquet file.

Doc_Chunk: Create chunks from one more more ducment.

Tokenization: Create embedding for document chunks.

Prerequisites

1- This notebook uses Wikimedia API for retrieving the initial HTML documents and llama-tokenizer from hugging face.

2- In order to use the notebook, users must provide a .env file with a valid access tokens to be used for accessing the wikimedia endpoint ( instructions can be found here ) and a Hugging face token for loading the model ( instructions can be found here). The .env file will look something like this:
WIKI_ACCESS_TOKEN='eyxxx' HF_READ_ACCESS_TOKEN='hf_xxx'

3- Install DPK library to environment

%%capture %pip install "data-prep-toolkit-transforms[docling2parquet,doc_chunk,tokenization]" %pip install pandas %pip install "numpy<2.0" from dotenv import load_dotenv load_dotenv(".env", override=True)

We will define and use a utility function for downloading the articles and saving them to the local disk:

load_corpus: Uses http request with the wikimedia api token to connect to a Wikimedia endpoint and retrieve the HTML articles that will be used as a seed for our LLM application. The article will then be saved to a local cache folder for further processing

def load_corpus(articles: list, folder: str) -> int: import os import re import requests headers = {"Authorization": f"Bearer {os.getenv('WIKI_ACCESS_TOKEN')}"} count = 0 for article in articles: try: endpoint = f"https://api.enterprise.wikimedia.com/v2/articles/{article}" response = requests.get(endpoint, headers=headers) response.raise_for_status() doc = response.json() for article in doc: filename = re.sub(r"[^a-zA-Z0-9_]", "_", article["name"]) with open(f"{folder}/{filename}.html", "w") as f: f.write(article["article_body"]["html"]) count = count + 1 except Exception as e: print(f"Failed to retrieve content: {e}") return count

🔗 Setup the experiment

DPK requires that we define a source/input folder where the transform sequence will be ingesting the document and a destination/output folder where the embedding will be stored. We will also initialize the list of articles we want to use in our application

import os import tempfile datafolder = tempfile.mkdtemp(dir=os.getcwd()) articles = ["Science,_technology,_engineering,_and_mathematics"] assert load_corpus(articles, datafolder) > 0, "Faild to download any documents"

🔗 Injest

Invoke Docling2Parquet tansform that will parse the HTML document and create a Markdown

%%capture from dpk_docling2parquet import Docling2Parquet, docling2parquet_contents_types result = Docling2Parquet( input_folder=datafolder, output_folder=f"{datafolder}/docling2parquet", data_files_to_use=[".html"], docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN, # markdown ).transform()

🔗 Chunk

Invoke DocChunk tansform to break the HTML document into chunks

%%capture from dpk_doc_chunk import DocChunk result = DocChunk( input_folder=f"{datafolder}/docling2parquet", output_folder=f"{datafolder}/doc_chunk", doc_chunk_chunking_type="li_markdown", doc_chunk_chunk_size_tokens=128, # default 128 doc_chunk_chunk_overlap_tokens=30, # default 30 ).transform()

🔗 Tokenization

Invoke Tokenization transform to create embedding of various chunks

%%capture from dpk_tokenization import Tokenization Tokenization( input_folder=f"{datafolder}/doc_chunk", output_folder=f"{datafolder}/tkn", tkn_tokenizer="hf-internal-testing/llama-tokenizer", tkn_chunk_size=20_000, ).transform()

✅ Summary

This notebook demonstrated how to run a DPK pipeline using IBM's Data Prep Kit and the Docling library. Each transform create one or more parquet files that users can explore to better understand what each stage of the pipeline produces. The see the output of the final stage, we will use Pandas to read the final parquet file and display its content

from pathlib import Path import pandas as pd parquet_files = list(Path(f"{datafolder}/tkn/").glob("*.parquet")) pd.concat(pd.read_parquet(file) for file in parquet_files)

tokens document_id document_length token_count

0 [1, 444, 11814, 262, 3002] f1f5b56a78829ab2165b3bbeb94b1167e4c5583c437f1d... 14 5

1 [1, 835, 5298, 13, 13, 797, 278, 4688, 29871, ... 402e82a9e81cc3d2494fac36bebf8bf1a2662800e5a00c... 2100 655

2 [1, 835, 5901, 21833, 13, 13, 29899, 321, 1254... 4fb389d0f0e999c2496f137b4a7c0671e79c09cf9477e9... 2833 968

3 [1, 444, 26304, 4978, 13, 13, 14136, 1967, 666... 3709997548d84224361a6835760b5ae48a1637e78d54a0... 1496 483

4 [1, 444, 2648, 4234] 1e1a58ad5664d963bc207dc791825258c33337c2559f6a... 13 4

5 [1, 835, 8314, 13, 13, 1576, 9870, 315, 1038, ... 83a63864e5ddfdd41ef0f813fb7aa3c95e04c029c32ab3... 1340 442

6 [1, 835, 7400, 13, 13, 6028, 1114, 27871, 2987... 5e29fb4e4cf37ed4c49994620e4a00da9693bc061e82c1... 1800 548

7 [1, 835, 7551, 13, 13, 25411, 3762, 8950, 6020... 3fc34013d93391a7504e84069190479fbc85ba7e7072cb... 1784 511

8 [1, 835, 4092, 13, 13, 13393, 884, 29901, 518,... e8b28e20e3fc3da40b6b368e30f9c953f5218370ec2f7a... 774 229

9 [1, 3191, 18312, 13, 13, 1576, 365, 29965, 152... 94b54fbda274536622f70442b18126f554610e8915b235... 1076 263

10 [1, 3191, 3444, 13, 13, 1576, 1024, 310, 317, ... fef9b66567944df131851834e2fdfb42b5c668e4b08031... 238 60

11 [1, 835, 12798, 12026, 13, 13, 1254, 12665, 97... eeb74ae3490539aa07f25987b6b2666dc907b39147e810... 366 97

12 [1, 835, 7513, 13, 13, 19302, 284, 2879, 515, ... cc2ccd2e9f4d0a8224716109f7a6e7b30f33ff1f8c7adf... 1395 402

13 [1, 835, 20537, 423, 13, 13, 797, 20537, 423, ... baf13788a018da24d86b630a9032eaeee54913bbbdd0d4... 511 137

14 [1, 835, 21215, 13, 13, 1254, 12665, 17800, 52... a5b3973ab3a98d10f4ae07a004d70c6cdcfacb41fda8d7... 1949 536

15 [1, 835, 26260, 13, 13, 797, 278, 518, 4819, 2... dfa35b16704a4dd549701a7821b6aa856f2dd5e5b69daf... 1042 291

16 [1, 835, 660, 14873, 13, 13, 797, 518, 29984, ... a0809b265e4a011407d38cd06c7b3ce5932683a2f9c6af... 852 282

17 [1, 835, 25960, 13, 13, 1254, 12665, 338, 760,... 85e8f3b2af3268d49e60451d3ac87b3bd281a70cf6c4b7... 1165 285

18 [1, 835, 498, 26517, 13, 13, 797, 29871, 29906... 15c924efdbf0135de91a095237cbe831275bab67ee1371... 1612 397

19 [1, 835, 26459, 13, 13, 29911, 29641, 728, 317... b473b50753dd07f08da05bbf776c57747ab85ba79cb081... 435 145

20 [1, 835, 3303, 3900, 13, 13, 797, 278, 3303, 3... 841cefc910bd5d1920187b23554ee67e0e65563373e6de... 1212 344

21 [1, 3191, 3086, 9327, 10606, 13, 13, 14804, 25... 63924939eab38ad6636495f1c5c13760014efe42b330a6... 1592 416

22 [1, 3191, 1954, 29885, 16783, 8898, 13, 13, 24... 44288e766c343592a44f3da59ad3b57a9f26096ac13412... 1653 465

23 [1, 3191, 13151, 13, 13, 13393, 884, 29901, 51... 40a0f6e213901d92f1a158c3e2a55ad2558eb1deaa973f... 4418 1285

24 [1, 3191, 6981, 1455, 17261, 297, 317, 4330, 2... 5cc92a05d39ee56e9c65cdb00f55bc9dcbe8bc1647a442... 1289 375

25 [1, 3191, 402, 1581, 330, 2547, 297, 317, 4330... 37c88bed7898d9a7406b5b0e4b1ccfaca65a732dff0c03... 821 280

26 [1, 3191, 4124, 2042, 2877, 297, 317, 4330, 29... f144b97af462b2ab8aba5cb6d9cba0cf5f383cc710aba0... 1093 297

27 [1, 3191, 3082, 24620, 277, 20193, 512, 4812, ... 16525e2054a7bb7543308ad4e6642bf60e66dc475a0e0a... 2203 538

28 [1, 3191, 317, 4330, 29924, 13151, 3189, 284, ... ebb319391e1bda81edd5ec214887150044c15cfc04f42f... 514 149

29 [1, 3191, 2522, 449, 292, 13, 13, 797, 29871, ... 882582d1f6202a4e495f67952d3a27929177745b1f575e... 850 261

30 [1, 3191, 10317, 310, 5282, 1947, 11104, 13, 1... 311aa5c91354b6bf575682be701981ccc6569eb35fd726... 1561 416

31 [1, 3191, 24206, 13, 13, 1254, 12665, 23992, 2... abaa73aba997ea267d9b556679c5d680810ee5baa231fa... 384 139

32 [1, 3191, 18991, 362, 13, 13, 1576, 518, 29048... 00f85d6dffd914d89eb44dbb4caa3a1c6b2af47f5c4c96... 878 247

33 [1, 3191, 17163, 29879, 13, 13, 797, 3979, 298... f8d901fca6dcac6c266cf2799da814c5f5b5644c3b9476... 2321 682

34 [1, 3191, 3599, 296, 6728, 13, 13, 7504, 3278,... 8347c4988e3acde4723696fbf63a0f2c13d61e92c8fbac... 2960 841

35 [1, 3191, 28488, 322, 11104, 304, 1371, 2693, ... c3d0c80c861ffcd422f60b78d693bb953b69dfc3c3d55f... 222 81

36 [1, 835, 18444, 13, 13, 797, 18444, 29892, 676... 9c41677100393c4e5e3bc4bc36caee5561cb5c93546aaf... 1143 288

37 [1, 444, 10152, 13, 13, 6330, 7456, 29901, 518... 83f0f668bac5736d5f23f750f86ebbe173c0a56e3c51b8... 2777 833

38 [1, 444, 365, 7210, 29911, 29984, 29974, 13, 1... 24bbfff971979686cd41132b491060bdaaf357bd3bc7cf... 2579 847

39 [1, 444, 15976, 293, 1608, 13, 13, 1576, 8569,... 1b8c147d642e4d53152e1be73223ed58e0788700d82c73... 4700 1299

40 [1, 444, 2823, 884, 13, 13, 29899, 518, 29907,... ac3fb4073323718ea3e32e006ed67c298af9801c4a03dd... 1310 443

41 [1, 444, 28318, 13, 13, 29896, 29889, 518, 298... 2dad03b0e2b81c47012f94be0ab730e9c8341f0311c59e... 59373 26470

42 [1, 444, 8725, 5183, 13, 13, 29899, 4699, 1522... 07dabd1b5cfa6f8c70f97eb33c3a19189a866eae1203c7... 2648 1075

43 [1, 444, 3985, 2988, 13, 13, 29899, 8213, 4475... ef8cc66ae18d7238680d07372859c5be061d57b955cf7d... 5025 705

	tokens	document_id	document_length	token_count
0	[1, 444, 11814, 262, 3002]	f1f5b56a78829ab2165b3bbeb94b1167e4c5583c437f1d...	14	5
1	[1, 835, 5298, 13, 13, 797, 278, 4688, 29871, ...	402e82a9e81cc3d2494fac36bebf8bf1a2662800e5a00c...	2100	655
2	[1, 835, 5901, 21833, 13, 13, 29899, 321, 1254...	4fb389d0f0e999c2496f137b4a7c0671e79c09cf9477e9...	2833	968
3	[1, 444, 26304, 4978, 13, 13, 14136, 1967, 666...	3709997548d84224361a6835760b5ae48a1637e78d54a0...	1496	483
4	[1, 444, 2648, 4234]	1e1a58ad5664d963bc207dc791825258c33337c2559f6a...	13	4
5	[1, 835, 8314, 13, 13, 1576, 9870, 315, 1038, ...	83a63864e5ddfdd41ef0f813fb7aa3c95e04c029c32ab3...	1340	442
6	[1, 835, 7400, 13, 13, 6028, 1114, 27871, 2987...	5e29fb4e4cf37ed4c49994620e4a00da9693bc061e82c1...	1800	548
7	[1, 835, 7551, 13, 13, 25411, 3762, 8950, 6020...	3fc34013d93391a7504e84069190479fbc85ba7e7072cb...	1784	511
8	[1, 835, 4092, 13, 13, 13393, 884, 29901, 518,...	e8b28e20e3fc3da40b6b368e30f9c953f5218370ec2f7a...	774	229
9	[1, 3191, 18312, 13, 13, 1576, 365, 29965, 152...	94b54fbda274536622f70442b18126f554610e8915b235...	1076	263
10	[1, 3191, 3444, 13, 13, 1576, 1024, 310, 317, ...	fef9b66567944df131851834e2fdfb42b5c668e4b08031...	238	60
11	[1, 835, 12798, 12026, 13, 13, 1254, 12665, 97...	eeb74ae3490539aa07f25987b6b2666dc907b39147e810...	366	97
12	[1, 835, 7513, 13, 13, 19302, 284, 2879, 515, ...	cc2ccd2e9f4d0a8224716109f7a6e7b30f33ff1f8c7adf...	1395	402
13	[1, 835, 20537, 423, 13, 13, 797, 20537, 423, ...	baf13788a018da24d86b630a9032eaeee54913bbbdd0d4...	511	137
14	[1, 835, 21215, 13, 13, 1254, 12665, 17800, 52...	a5b3973ab3a98d10f4ae07a004d70c6cdcfacb41fda8d7...	1949	536
15	[1, 835, 26260, 13, 13, 797, 278, 518, 4819, 2...	dfa35b16704a4dd549701a7821b6aa856f2dd5e5b69daf...	1042	291
16	[1, 835, 660, 14873, 13, 13, 797, 518, 29984, ...	a0809b265e4a011407d38cd06c7b3ce5932683a2f9c6af...	852	282
17	[1, 835, 25960, 13, 13, 1254, 12665, 338, 760,...	85e8f3b2af3268d49e60451d3ac87b3bd281a70cf6c4b7...	1165	285
18	[1, 835, 498, 26517, 13, 13, 797, 29871, 29906...	15c924efdbf0135de91a095237cbe831275bab67ee1371...	1612	397
19	[1, 835, 26459, 13, 13, 29911, 29641, 728, 317...	b473b50753dd07f08da05bbf776c57747ab85ba79cb081...	435	145
20	[1, 835, 3303, 3900, 13, 13, 797, 278, 3303, 3...	841cefc910bd5d1920187b23554ee67e0e65563373e6de...	1212	344
21	[1, 3191, 3086, 9327, 10606, 13, 13, 14804, 25...	63924939eab38ad6636495f1c5c13760014efe42b330a6...	1592	416
22	[1, 3191, 1954, 29885, 16783, 8898, 13, 13, 24...	44288e766c343592a44f3da59ad3b57a9f26096ac13412...	1653	465
23	[1, 3191, 13151, 13, 13, 13393, 884, 29901, 51...	40a0f6e213901d92f1a158c3e2a55ad2558eb1deaa973f...	4418	1285
24	[1, 3191, 6981, 1455, 17261, 297, 317, 4330, 2...	5cc92a05d39ee56e9c65cdb00f55bc9dcbe8bc1647a442...	1289	375
25	[1, 3191, 402, 1581, 330, 2547, 297, 317, 4330...	37c88bed7898d9a7406b5b0e4b1ccfaca65a732dff0c03...	821	280
26	[1, 3191, 4124, 2042, 2877, 297, 317, 4330, 29...	f144b97af462b2ab8aba5cb6d9cba0cf5f383cc710aba0...	1093	297
27	[1, 3191, 3082, 24620, 277, 20193, 512, 4812, ...	16525e2054a7bb7543308ad4e6642bf60e66dc475a0e0a...	2203	538
28	[1, 3191, 317, 4330, 29924, 13151, 3189, 284, ...	ebb319391e1bda81edd5ec214887150044c15cfc04f42f...	514	149
29	[1, 3191, 2522, 449, 292, 13, 13, 797, 29871, ...	882582d1f6202a4e495f67952d3a27929177745b1f575e...	850	261
30	[1, 3191, 10317, 310, 5282, 1947, 11104, 13, 1...	311aa5c91354b6bf575682be701981ccc6569eb35fd726...	1561	416
31	[1, 3191, 24206, 13, 13, 1254, 12665, 23992, 2...	abaa73aba997ea267d9b556679c5d680810ee5baa231fa...	384	139
32	[1, 3191, 18991, 362, 13, 13, 1576, 518, 29048...	00f85d6dffd914d89eb44dbb4caa3a1c6b2af47f5c4c96...	878	247
33	[1, 3191, 17163, 29879, 13, 13, 797, 3979, 298...	f8d901fca6dcac6c266cf2799da814c5f5b5644c3b9476...	2321	682
34	[1, 3191, 3599, 296, 6728, 13, 13, 7504, 3278,...	8347c4988e3acde4723696fbf63a0f2c13d61e92c8fbac...	2960	841
35	[1, 3191, 28488, 322, 11104, 304, 1371, 2693, ...	c3d0c80c861ffcd422f60b78d693bb953b69dfc3c3d55f...	222	81
36	[1, 835, 18444, 13, 13, 797, 18444, 29892, 676...	9c41677100393c4e5e3bc4bc36caee5561cb5c93546aaf...	1143	288
37	[1, 444, 10152, 13, 13, 6330, 7456, 29901, 518...	83f0f668bac5736d5f23f750f86ebbe173c0a56e3c51b8...	2777	833
38	[1, 444, 365, 7210, 29911, 29984, 29974, 13, 1...	24bbfff971979686cd41132b491060bdaaf357bd3bc7cf...	2579	847
39	[1, 444, 15976, 293, 1608, 13, 13, 1576, 8569,...	1b8c147d642e4d53152e1be73223ed58e0788700d82c73...	4700	1299
40	[1, 444, 2823, 884, 13, 13, 29899, 518, 29907,...	ac3fb4073323718ea3e32e006ed67c298af9801c4a03dd...	1310	443
41	[1, 444, 28318, 13, 13, 29896, 29889, 518, 298...	2dad03b0e2b81c47012f94be0ab730e9c8341f0311c59e...	59373	26470
42	[1, 444, 8725, 5183, 13, 13, 29899, 4699, 1522...	07dabd1b5cfa6f8c70f97eb33c3a19189a866eae1203c7...	2648	1075
43	[1, 444, 3985, 2988, 13, 13, 29899, 8213, 4475...	ef8cc66ae18d7238680d07372859c5be061d57b955cf7d...	5025	705