🛡️ Chunking and tokenizing HTML documents using Data Prep Kit and the Docling Transforms¶
This notebook demonstrates how to build a sequence of DPK transforms for ingesting HTML documents using Docling2Parquet transforms and chunking them using Doc_Chunk transform. Both transforms are based on the Docling library.
In this example, we will use the Wikimedia API to retrieve the HTML articles that will be used as a seed for our LLM application. Once the articles are loaded to a local cache, we will construct and invoke the sequence of transforms to ingest the content and produce the embedding for the chuncked content.
🔍 Why DPK Pipelines¶
DPK transform pipelines are intended to simplify how any number of transforms can be executed in a sequence to ingest, annotate, filter and create embedding used for LLM post-training and RAG applications.
🧰 Key Transforms in This Recipe¶
We will use the following transforms from DPK:
Docling2Parquet
: Ingest one or more HTML document and turn it into a parquet file.Doc_Chunk
: Create chunks from one more more ducment.Tokenization
: Create embedding for document chunks.
Prerequisites¶
1- This notebook uses Wikimedia API for retrieving the initial HTML documents and llama-tokenizer from hugging face.
2- In order to use the notebook, users must provide a .env file with a valid access tokens to be used for accessing the wikimedia endpoint ( instructions can be found here ) and a Hugging face token for loading the model ( instructions can be found here). The .env file will look something like this:
WIKI_ACCESS_TOKEN='eyxxx'
HF_READ_ACCESS_TOKEN='hf_xxx'
3- Install DPK library to environment
%%capture
%pip install "data-prep-toolkit-transforms[docling2parquet,doc_chunk,tokenization]"
%pip install pandas
%pip install "numpy<2.0"
from dotenv import load_dotenv
load_dotenv(".env", override=True)
We will define and use a utility function for downloading the articles and saving them to the local disk:
load_corpus: Uses http request with the wikimedia api token to connect to a Wikimedia endpoint and retrieve the HTML articles that will be used as a seed for our LLM application. The article will then be saved to a local cache folder for further processing
def load_corpus(articles: list, folder: str) -> int:
import os
import re
import requests
headers = {"Authorization": f"Bearer {os.getenv('WIKI_ACCESS_TOKEN')}"}
count = 0
for article in articles:
try:
endpoint = f"https://api.enterprise.wikimedia.com/v2/articles/{article}"
response = requests.get(endpoint, headers=headers)
response.raise_for_status()
doc = response.json()
for article in doc:
filename = re.sub(r"[^a-zA-Z0-9_]", "_", article["name"])
with open(f"{folder}/{filename}.html", "w") as f:
f.write(article["article_body"]["html"])
count = count + 1
except Exception as e:
print(f"Failed to retrieve content: {e}")
return count
🔗 Setup the experiment¶
DPK requires that we define a source/input folder where the transform sequence will be ingesting the document and a destination/output folder where the embedding will be stored. We will also initialize the list of articles we want to use in our application
import os
import tempfile
datafolder = tempfile.mkdtemp(dir=os.getcwd())
articles = ["Science,_technology,_engineering,_and_mathematics"]
assert load_corpus(articles, datafolder) > 0, "Faild to download any documents"
🔗 Injest¶
Invoke Docling2Parquet tansform that will parse the HTML document and create a Markdown
%%capture
from dpk_docling2parquet import Docling2Parquet, docling2parquet_contents_types
result = Docling2Parquet(
input_folder=datafolder,
output_folder=f"{datafolder}/docling2parquet",
data_files_to_use=[".html"],
docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN, # markdown
).transform()
🔗 Chunk¶
Invoke DocChunk tansform to break the HTML document into chunks
%%capture
from dpk_doc_chunk import DocChunk
result = DocChunk(
input_folder=f"{datafolder}/docling2parquet",
output_folder=f"{datafolder}/doc_chunk",
doc_chunk_chunking_type="li_markdown",
doc_chunk_chunk_size_tokens=128, # default 128
doc_chunk_chunk_overlap_tokens=30, # default 30
).transform()
🔗 Tokenization¶
Invoke Tokenization transform to create embedding of various chunks
%%capture
from dpk_tokenization import Tokenization
Tokenization(
input_folder=f"{datafolder}/doc_chunk",
output_folder=f"{datafolder}/tkn",
tkn_tokenizer="hf-internal-testing/llama-tokenizer",
tkn_chunk_size=20_000,
).transform()
✅ Summary¶
This notebook demonstrated how to run a DPK pipeline using IBM's Data Prep Kit and the Docling library. Each transform create one or more parquet files that users can explore to better understand what each stage of the pipeline produces. The see the output of the final stage, we will use Pandas to read the final parquet file and display its content
from pathlib import Path
import pandas as pd
parquet_files = list(Path(f"{datafolder}/tkn/").glob("*.parquet"))
pd.concat(pd.read_parquet(file) for file in parquet_files)
tokens | document_id | document_length | token_count | |
---|---|---|---|---|
0 | [1, 444, 11814, 262, 3002] | f1f5b56a78829ab2165b3bbeb94b1167e4c5583c437f1d... | 14 | 5 |
1 | [1, 835, 5298, 13, 13, 797, 278, 4688, 29871, ... | 402e82a9e81cc3d2494fac36bebf8bf1a2662800e5a00c... | 2100 | 655 |
2 | [1, 835, 5901, 21833, 13, 13, 29899, 321, 1254... | 4fb389d0f0e999c2496f137b4a7c0671e79c09cf9477e9... | 2833 | 968 |
3 | [1, 444, 26304, 4978, 13, 13, 14136, 1967, 666... | 3709997548d84224361a6835760b5ae48a1637e78d54a0... | 1496 | 483 |
4 | [1, 444, 2648, 4234] | 1e1a58ad5664d963bc207dc791825258c33337c2559f6a... | 13 | 4 |
5 | [1, 835, 8314, 13, 13, 1576, 9870, 315, 1038, ... | 83a63864e5ddfdd41ef0f813fb7aa3c95e04c029c32ab3... | 1340 | 442 |
6 | [1, 835, 7400, 13, 13, 6028, 1114, 27871, 2987... | 5e29fb4e4cf37ed4c49994620e4a00da9693bc061e82c1... | 1800 | 548 |
7 | [1, 835, 7551, 13, 13, 25411, 3762, 8950, 6020... | 3fc34013d93391a7504e84069190479fbc85ba7e7072cb... | 1784 | 511 |
8 | [1, 835, 4092, 13, 13, 13393, 884, 29901, 518,... | e8b28e20e3fc3da40b6b368e30f9c953f5218370ec2f7a... | 774 | 229 |
9 | [1, 3191, 18312, 13, 13, 1576, 365, 29965, 152... | 94b54fbda274536622f70442b18126f554610e8915b235... | 1076 | 263 |
10 | [1, 3191, 3444, 13, 13, 1576, 1024, 310, 317, ... | fef9b66567944df131851834e2fdfb42b5c668e4b08031... | 238 | 60 |
11 | [1, 835, 12798, 12026, 13, 13, 1254, 12665, 97... | eeb74ae3490539aa07f25987b6b2666dc907b39147e810... | 366 | 97 |
12 | [1, 835, 7513, 13, 13, 19302, 284, 2879, 515, ... | cc2ccd2e9f4d0a8224716109f7a6e7b30f33ff1f8c7adf... | 1395 | 402 |
13 | [1, 835, 20537, 423, 13, 13, 797, 20537, 423, ... | baf13788a018da24d86b630a9032eaeee54913bbbdd0d4... | 511 | 137 |
14 | [1, 835, 21215, 13, 13, 1254, 12665, 17800, 52... | a5b3973ab3a98d10f4ae07a004d70c6cdcfacb41fda8d7... | 1949 | 536 |
15 | [1, 835, 26260, 13, 13, 797, 278, 518, 4819, 2... | dfa35b16704a4dd549701a7821b6aa856f2dd5e5b69daf... | 1042 | 291 |
16 | [1, 835, 660, 14873, 13, 13, 797, 518, 29984, ... | a0809b265e4a011407d38cd06c7b3ce5932683a2f9c6af... | 852 | 282 |
17 | [1, 835, 25960, 13, 13, 1254, 12665, 338, 760,... | 85e8f3b2af3268d49e60451d3ac87b3bd281a70cf6c4b7... | 1165 | 285 |
18 | [1, 835, 498, 26517, 13, 13, 797, 29871, 29906... | 15c924efdbf0135de91a095237cbe831275bab67ee1371... | 1612 | 397 |
19 | [1, 835, 26459, 13, 13, 29911, 29641, 728, 317... | b473b50753dd07f08da05bbf776c57747ab85ba79cb081... | 435 | 145 |
20 | [1, 835, 3303, 3900, 13, 13, 797, 278, 3303, 3... | 841cefc910bd5d1920187b23554ee67e0e65563373e6de... | 1212 | 344 |
21 | [1, 3191, 3086, 9327, 10606, 13, 13, 14804, 25... | 63924939eab38ad6636495f1c5c13760014efe42b330a6... | 1592 | 416 |
22 | [1, 3191, 1954, 29885, 16783, 8898, 13, 13, 24... | 44288e766c343592a44f3da59ad3b57a9f26096ac13412... | 1653 | 465 |
23 | [1, 3191, 13151, 13, 13, 13393, 884, 29901, 51... | 40a0f6e213901d92f1a158c3e2a55ad2558eb1deaa973f... | 4418 | 1285 |
24 | [1, 3191, 6981, 1455, 17261, 297, 317, 4330, 2... | 5cc92a05d39ee56e9c65cdb00f55bc9dcbe8bc1647a442... | 1289 | 375 |
25 | [1, 3191, 402, 1581, 330, 2547, 297, 317, 4330... | 37c88bed7898d9a7406b5b0e4b1ccfaca65a732dff0c03... | 821 | 280 |
26 | [1, 3191, 4124, 2042, 2877, 297, 317, 4330, 29... | f144b97af462b2ab8aba5cb6d9cba0cf5f383cc710aba0... | 1093 | 297 |
27 | [1, 3191, 3082, 24620, 277, 20193, 512, 4812, ... | 16525e2054a7bb7543308ad4e6642bf60e66dc475a0e0a... | 2203 | 538 |
28 | [1, 3191, 317, 4330, 29924, 13151, 3189, 284, ... | ebb319391e1bda81edd5ec214887150044c15cfc04f42f... | 514 | 149 |
29 | [1, 3191, 2522, 449, 292, 13, 13, 797, 29871, ... | 882582d1f6202a4e495f67952d3a27929177745b1f575e... | 850 | 261 |
30 | [1, 3191, 10317, 310, 5282, 1947, 11104, 13, 1... | 311aa5c91354b6bf575682be701981ccc6569eb35fd726... | 1561 | 416 |
31 | [1, 3191, 24206, 13, 13, 1254, 12665, 23992, 2... | abaa73aba997ea267d9b556679c5d680810ee5baa231fa... | 384 | 139 |
32 | [1, 3191, 18991, 362, 13, 13, 1576, 518, 29048... | 00f85d6dffd914d89eb44dbb4caa3a1c6b2af47f5c4c96... | 878 | 247 |
33 | [1, 3191, 17163, 29879, 13, 13, 797, 3979, 298... | f8d901fca6dcac6c266cf2799da814c5f5b5644c3b9476... | 2321 | 682 |
34 | [1, 3191, 3599, 296, 6728, 13, 13, 7504, 3278,... | 8347c4988e3acde4723696fbf63a0f2c13d61e92c8fbac... | 2960 | 841 |
35 | [1, 3191, 28488, 322, 11104, 304, 1371, 2693, ... | c3d0c80c861ffcd422f60b78d693bb953b69dfc3c3d55f... | 222 | 81 |
36 | [1, 835, 18444, 13, 13, 797, 18444, 29892, 676... | 9c41677100393c4e5e3bc4bc36caee5561cb5c93546aaf... | 1143 | 288 |
37 | [1, 444, 10152, 13, 13, 6330, 7456, 29901, 518... | 83f0f668bac5736d5f23f750f86ebbe173c0a56e3c51b8... | 2777 | 833 |
38 | [1, 444, 365, 7210, 29911, 29984, 29974, 13, 1... | 24bbfff971979686cd41132b491060bdaaf357bd3bc7cf... | 2579 | 847 |
39 | [1, 444, 15976, 293, 1608, 13, 13, 1576, 8569,... | 1b8c147d642e4d53152e1be73223ed58e0788700d82c73... | 4700 | 1299 |
40 | [1, 444, 2823, 884, 13, 13, 29899, 518, 29907,... | ac3fb4073323718ea3e32e006ed67c298af9801c4a03dd... | 1310 | 443 |
41 | [1, 444, 28318, 13, 13, 29896, 29889, 518, 298... | 2dad03b0e2b81c47012f94be0ab730e9c8341f0311c59e... | 59373 | 26470 |
42 | [1, 444, 8725, 5183, 13, 13, 29899, 4699, 1522... | 07dabd1b5cfa6f8c70f97eb33c3a19189a866eae1203c7... | 2648 | 1075 |
43 | [1, 444, 3985, 2988, 13, 13, 29899, 8213, 4475... | ef8cc66ae18d7238680d07372859c5be061d57b955cf7d... | 5025 | 705 |