RAG with OpenSearch¶
Step | Tech | Execution |
---|---|---|
Embedding | Ollama (IBM Granite Embedding 30M) | 💻 Local |
Vector store | OpenSearch 2.19.3 | 💻 Local |
Gen AI | Ollama (IBM Granite 3.3 8B) | 💻 Local |
This is a code recipe that uses OpenSearch, an open-source search and analytics tool, and the LlamaIndex framework to perform RAG over documents parsed by Docling.
In this notebook, we accomplish the following:
- 📚 Parse documents using Docling's document conversion capabilities
- 🧩 Perform hierarchical chunking of the documents using Docling
- 🔢 Generate text embeddings on document chunks
- 🤖 Perform RAG using OpenSearch and the LlamaIndex framework
- 🛠️ Leverage the transformation and structure capabilities of Docling documents for RAG
Preparation¶
Running the notebook¶
For running this notebook on your machine, you can use applications like Jupyter Notebook or Visual Studio Code.
💡 For best results, please use GPU acceleration to run this notebook.
Virtual environment¶
Before installing dependencies and to avoid conflicts in your environment, it is advisable to use a virtual environment (venv). For instance, uv is a popular tool to manage virtual environments and dependencies. You can install it with:
curl -LsSf https://astral.sh/uv/install.sh | sh
Then create the virtual environment and activate it:
uv venv
source .venv/bin/activate
Refer to Installing uv for more details.
Dependencies¶
To start, install the required dependencies by running the following command:
! uv pip install -q --no-progress notebook ipywidgets docling llama-index-readers-file llama-index-readers-docling llama-index-node-parser-docling llama-index-vector-stores-opensearch llama-index-embeddings-ollama llama-index-llms-ollama
We now import all the necessary modules for this notebook:
import logging
from pathlib import Path
from tempfile import mkdtemp
import requests
import torch
from docling_core.transforms.chunker import HierarchicalChunker
from docling_core.transforms.chunker.hierarchical_chunker import (
ChunkingDocSerializer,
ChunkingSerializerProvider,
)
from docling_core.transforms.serializer.markdown import MarkdownTableSerializer
from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex
from llama_index.core.schema import TransformComponent
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.node_parser.docling import DoclingNodeParser
from llama_index.readers.docling import DoclingReader
from llama_index.vector_stores.opensearch import (
OpensearchVectorClient,
OpensearchVectorStore,
)
from rich.console import Console
from rich.pretty import pprint
logging.getLogger().setLevel(logging.WARNING)
GPU Checking¶
Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.
The code below checks if a GPU is available, either via CUDA or MPS.
# Check if GPU or MPS is available
if torch.cuda.is_available():
device = torch.device("cuda")
print(f"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
device = torch.device("mps")
print("MPS GPU is enabled.")
else:
raise OSError(
"No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured."
)
MPS GPU is enabled.
Local OpenSearch instance¶
To run the notebook locally, we can pull an OpenSearch image and run a single node for local development. You can use a container tool like Podman or Docker. In the interest of simplicity, we disable the SSL option for this example.
💡 The version of the OpenSearch instance needs to be compatible with the version of the OpenSearch Python Client library, since this library is used by the LlamaIndex framework, which we leverage in this notebook.
On your computer terminal run:
podman run \
-it \
--pull always \
-p 9200:9200 \
-p 9600:9600 \
-e "discovery.type=single-node" \
-e DISABLE_INSTALL_DEMO_CONFIG=true \
-e DISABLE_SECURITY_PLUGIN=true \
--name opensearch-node \
-d opensearchproject/opensearch:2.19.3
Once the instance is running, verify that you can connect to OpenSearch:
response = requests.get("http://localhost:9200")
print(response.text)
{ "name" : "b8582205a25c", "cluster_name" : "docker-cluster", "cluster_uuid" : "VxJ5hoxDRn68jodknsNdag", "version" : { "distribution" : "opensearch", "number" : "2.19.3", "build_type" : "tar", "build_hash" : "a90f864b8524bc75570a8461ccb569d2a4bfed42", "build_date" : "2025-07-21T22:34:54.259463448Z", "build_snapshot" : false, "lucene_version" : "9.12.2", "minimum_wire_compatibility_version" : "7.10.0", "minimum_index_compatibility_version" : "7.0.0" }, "tagline" : "The OpenSearch Project: https://opensearch.org/" }
Ollama models¶
We will use Ollama, an open-source tool to run language models on your local computer, rather than relying on cloud services.
In this example, we will use:
- IBM Granite Embedding 30M English for text embeddings
- IBM Granite 3.3 8B Instruct for model inference
Once Ollama is installed on your computer, you can pull and run the models above from your terminal:
ollama run granite-embedding:30m
ollama run granite3.3:8b
Setup¶
We setup the main variables for OpenSearch and the embedding and generation models.
# http endpoint for your cluster
OPENSEARCH_ENDPOINT = "http://localhost:9200"
# index to store the Docling document vectors
OPENSEARCH_INDEX = "docling-index"
# the embedding model
EMBED_MODEL = OllamaEmbedding(model_name="granite-embedding:30m")
# the generation model
GEN_MODEL = Ollama(
model="granite3.3:8b",
request_timeout=120.0,
# Manually set the context window to limit memory usage
context_window=8000,
# Set temperature to 0 for reproducibility of the results
temperature=0.0,
)
# a sample document
SOURCE = "https://arxiv.org/pdf/2408.09869"
# a sample query
QUERY = "Which are the main AI models in Docling?"
embed_dim = len(EMBED_MODEL.get_text_embedding("hi"))
print(f"The embedding dimension is {embed_dim}.")
The embedding dimension is 384.
Process Data Using Docling¶
Docling can parse various document formats into a unified representation (DoclingDocument), which can then be exported to different output formats. For a full list of supported input and output formats, please refer to Supported formats section of Docling's documentation.
In this recipe, we will use a single PDF file, the Docling Technical Report. We will process it using a Hierarchical Chunker provided by Docling to generate structured, hierarchical chunks suitable for downstream RAG tasks.
💡 The Hybrid Chunker is an alternative with additional capabilities for an efficient segmentation of the document. Check the Hybrid Chunking example for more details.
tmp_dir_path = Path(mkdtemp())
req = requests.get(SOURCE)
with open(tmp_dir_path / f"{Path(SOURCE).name}.pdf", "wb") as out_file:
out_file.write(req.content)
# create a Docling reader and a node parser with default Hierarchical chunker
reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
dir_reader = SimpleDirectoryReader(
input_dir=tmp_dir_path,
file_extractor={".pdf": reader},
)
# load the PDF files
documents = dir_reader.load_data()
/Users/ceb/git/docling/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py:684: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used. warnings.warn(warn_msg) /Users/ceb/git/docling/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py:684: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, then device pinned memory won't be used. warnings.warn(warn_msg)
Load Data into OpenSearch¶
Define the Transformations¶
Before the actual ingestion of data, we need to define the data transformations to apply on the DoclingDocument
:
DoclingNodeParser
executes the document-based chunkingMetadataTransform
is a custom transformation to ensure that generated chunk metadata is best formatted for indexing with OpenSearch
# create a Docling node parser
node_parser = DoclingNodeParser()
# create a custom transformation to avoid out-of-range integers
class MetadataTransform(TransformComponent):
def __call__(self, nodes, **kwargs):
for node in nodes:
binary_hash = node.metadata.get("origin", {}).get("binary_hash", None)
if binary_hash is not None:
node.metadata["origin"]["binary_hash"] = str(binary_hash)
return nodes
Embed and Insert the Data¶
In this step, we create an OpenSearchVectorClient
, which encapsulates the logic for a single OpenSearch index with vector search enabled.
We then initialize the index using our sample data (a single PDF file), the Docling node parser, and the OpenSearch client that we just created.
# OpensearchVectorClient stores text in this field by default
text_field = "content"
# OpensearchVectorClient stores embeddings in this field by default
embed_field = "embedding"
client = OpensearchVectorClient(
endpoint="http://localhost:9200",
index=OPENSEARCH_INDEX,
dim=embed_dim,
embedding_field=embed_field,
text_field=text_field,
)
vector_store = OpensearchVectorStore(client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents=documents,
transformations=[node_parser, MetadataTransform()],
storage_context=storage_context,
embed_model=EMBED_MODEL,
)
2025-09-10 13:16:53,752 - WARNING - GET http://localhost:9200/docling-index [status:404 request:0.015s]
console = Console(width=88)
QUERY = "Which are the main AI models in Docling?"
query_engine = index.as_query_engine(llm=GEN_MODEL)
res = query_engine.query(QUERY)
console.print(f"👤: {QUERY}\n🤖: {res.response.strip()}")
👤: Which are the main AI models in Docling? 🤖: Docling primarily utilizes two AI models. The first one is a layout analysis model, serving as an accurate object-detector for page elements. The second model is TableFormer, a state-of-the-art table structure recognition model. Both models are pre-trained and their weights are hosted on Hugging Face. They also power the deepsearch-experience, a cloud-native service for knowledge exploration tasks.
Custom serializers¶
Docling can extract the table content and process it for chunking, like other text elements.
In the following example, the response is generated from a retrieved chunk containing a table.
QUERY = (
"What are the performance metrics of Docling-native PDF backend with 16 threads?"
)
query_engine = index.as_query_engine(llm=GEN_MODEL)
res = query_engine.query(QUERY)
console.print(f"👤: {QUERY}\n🤖: {res.response.strip()}")
👤: What are the performance metrics of Docling-native PDF backend with 16 threads? 🤖: The Docling-native PDF backend, when utilized with 16 threads on an Apple M3 Max system, completed the processing in approximately 167 seconds. It achieved a throughput of about 1.34 pages per second and peaked at a memory usage of 6.20 GB (resident set size). On an Intel Xeon E5-2690 system with the same thread count, it took around 244 seconds to process, managed a throughput of 0.92 pages per second, and reached a peak memory usage of 6.16 GB.
The result above was generated with the table serialized in a triplet format. Language models may perform better on complex tables if the structure is represented in a format that is widely adopted, like markdown.
For this purpose, we can leverage a custom serializer that transforms tables in markdown format:
class MDTableSerializerProvider(ChunkingSerializerProvider):
def get_serializer(self, doc):
return ChunkingDocSerializer(
doc=doc,
# configuring a different table serializer
table_serializer=MarkdownTableSerializer(),
)
# clear the database from the previous chunks
client.clear()
vector_store.clear()
chunker = HierarchicalChunker(
serializer_provider=MDTableSerializerProvider(),
)
node_parser = DoclingNodeParser(chunker=chunker)
index = VectorStoreIndex.from_documents(
documents=documents,
transformations=[node_parser, MetadataTransform()],
storage_context=storage_context,
embed_model=EMBED_MODEL,
)
Observe that the generated response is now more accurate. Refer to the Advanced chunking & serialization example for more details on serialization strategies.
query_engine = index.as_query_engine(llm=GEN_MODEL)
QUERY = "Which backend is faster on Intel with 4 threads?"
res = query_engine.query(QUERY)
console.print(f"👤: {QUERY}\n🤖: {res.response.strip()}")
👤: Which backend is faster on Intel with 4 threads? 🤖: The pypdfium backend is faster than the Docling-native PDF backend for an Intel Xeon E5-2690 CPU with a thread budget of 4, as indicated in Table 1. The pypdfium backend completes the processing in 239 seconds, achieving a throughput of 0.94 pages per second, while the Docling-native PDF backend takes 375 seconds.
Refer to the Advanced chunking & serialization example for more details on serialization strategies.
Filter-context Query¶
By default, the DoclingNodeParser
will keep the hierarchical information of items when creating the chunks.
That information will be stored as metadata in the OpenSearch index. Leveraging the document structure is a powerful
feature of Docling for improving RAG systems, both for retrieval and for answer generation.
For example, we can use chunk metadata with layout information to run queries in a filter context, for high retrieval accuracy.
Using the previous setup, we can see that the most similar chunk corresponds to a paragraph without enough grounding for the question:
def display_nodes(nodes):
res = []
for idx, item in enumerate(nodes):
doc_res = {"k": idx + 1, "score": item.score, "text": item.text, "items": []}
doc_items = item.metadata["doc_items"]
for doc in doc_items:
doc_res["items"].append({"ref": doc["self_ref"], "label": doc["label"]})
res.append(doc_res)
pprint(res, max_string=200)
retriever = index.as_retriever(similarity_top_k=1)
QUERY = "How does pypdfium perform?"
nodes = retriever.retrieve(QUERY)
print(QUERY)
display_nodes(nodes)
How does pypdfium perform?
[ │ { │ │ 'k': 1, │ │ 'score': 0.6800267, │ │ 'text': 'If you need to run Docling in very low-resource environments, please consider configuring the pypdfium backend. While it is faster and more memory efficient than the default docling-parse backend, it '+90, │ │ 'items': [{'ref': '#/texts/68', 'label': 'text'}] │ } ]
We may want to restrict the retrieval to only those chunks containing tabular data, expecting to retrieve more quantitative information for our type of question:
filters = MetadataFilters(
filters=[MetadataFilter(key="doc_items.label", value="table")]
)
table_retriever = index.as_retriever(filters=filters, similarity_top_k=1)
nodes = table_retriever.retrieve(QUERY)
print(QUERY)
display_nodes(nodes)
How does pypdfium perform?
[ │ { │ │ 'k': 1, │ │ 'score': 0.6078317, │ │ 'text': 'Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution (TT'+1014, │ │ 'items': [{'ref': '#/texts/72', 'label': 'caption'}, {'ref': '#/tables/0', 'label': 'table'}] │ } ]
Hybrid Search Retrieval with RRF¶
Hybrid search combines keyword and semantic search to improve search relevance. To avoid relying on traditional score normalization techniques, the reciprocal rank fusion (RRF) feature on hybrid search can significantly improve the relevance of the retrieved chunks in our RAG system.
First, create a search pipeline and specify RRF as technique:
url = f"{OPENSEARCH_ENDPOINT}/_search/pipeline/rrf-pipeline"
headers = {"Content-Type": "application/json"}
body = {
"description": "Post processor for hybrid RRF search",
"phase_results_processors": [
{"score-ranker-processor": {"combination": {"technique": "rrf"}}}
],
}
response = requests.put(url, json=body, headers=headers)
print(response.text)
{"acknowledged":true}
We can then repeat the previous steps to get a VectorStoreIndex
object, leveraging the search pipeline that we just created:
client_rrf = OpensearchVectorClient(
endpoint=OPENSEARCH_ENDPOINT,
index=f"{OPENSEARCH_INDEX}-rrf",
dim=embed_dim,
embedding_field=embed_field,
text_field=text_field,
search_pipeline="rrf-pipeline",
)
vector_store_rrf = OpensearchVectorStore(client_rrf)
storage_context_rrf = StorageContext.from_defaults(vector_store=vector_store_rrf)
index_hybrid = VectorStoreIndex.from_documents(
documents=documents,
transformations=[node_parser, MetadataTransform()],
storage_context=storage_context_rrf,
embed_model=EMBED_MODEL,
)
2025-09-10 13:17:10,104 - WARNING - GET http://localhost:9200/docling-index-rrf [status:404 request:0.001s]
The first retriever, which entirely relies on semantic (vector) search, fails to catch the supporting chunk for the given question in the top 1 position. Note that we highlight few expected keywords for illustration purposes.
QUERY = "Does Docling project provide a Dockerfile?"
retriever = index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve(QUERY)
exp = "Docling also provides a Dockerfile"
start = "[bold yellow]"
end = "[/]"
for idx, item in enumerate(nodes):
console.print(
f"*** k={idx + 1} ***\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}"
)
*** k=1 *** We encourage everyone to propose or implement additional features and models, and will gladly take your inputs and contributions under review . The codebase of Docling is open for use and contribution, under the MIT license agreement and in alignment with our contributing guidelines included in the Docling repository. If you use Docling in your projects, please consider citing this technical report.
*** k=2 *** Optionally, you can configure custom pipeline features and runtime options, such as turning on or off features (e.g. OCR, table structure recognition), enforcing limits on the input document size, and defining the budget of CPU threads. Advanced usage examples and options are documented in the README file. Docling also provides a Dockerfile to demonstrate how to install and run it inside a container.
*** k=3 *** Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of content, as well as augment extracted document metadata with additional information. Further investment into testing and optimizing GPU acceleration as well as improving the Docling-native PDF backend are on our roadmap, too.
However, the retriever with the hybrid search pipeline effectively recognizes the key paragraph in the first position:
retriever_rrf = index_hybrid.as_retriever(
vector_store_query_mode=VectorStoreQueryMode.HYBRID, similarity_top_k=3
)
nodes = retriever_rrf.retrieve(QUERY)
for idx, item in enumerate(nodes):
console.print(
f"*** k={idx + 1} ***\n{item.text.strip().replace(exp, f'{start}{exp}{end}')}"
)
*** k=1 *** Optionally, you can configure custom pipeline features and runtime options, such as turning on or off features (e.g. OCR, table structure recognition), enforcing limits on the input document size, and defining the budget of CPU threads. Advanced usage examples and options are documented in the README file. Docling also provides a Dockerfile to demonstrate how to install and run it inside a container.
*** k=2 *** We therefore decided to provide multiple backend choices, and additionally open-source a custombuilt PDF parser, which is based on the low-level qpdf [4] library. It is made available in a separate package named docling-parse and powers the default PDF backend in Docling. As an alternative, we provide a PDF backend relying on pypdfium , which may be a safe backup choice in certain cases, e.g. if issues are seen with particular font encodings.
*** k=3 *** We encourage everyone to propose or implement additional features and models, and will gladly take your inputs and contributions under review . The codebase of Docling is open for use and contribution, under the MIT license agreement and in alignment with our contributing guidelines included in the Docling repository. If you use Docling in your projects, please consider citing this technical report.