RAG with MongoDB + VoyageAI¶

Step	Tech	Execution
Embedding	Voyage AI	🌐 Remote
Vector store	MongoDB	🌐 Remote
Gen AI	Azure Open AI	🌐 Remote

How to cook¶

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) pipeline using MongoDB as a vector store and Voyage AI embedding models for semantic search. The workflow involves extracting and chunking text from documents, generating embeddings with Voyage AI, storing vectors in MongoDB, and leveraging OpenAI for generative responses.

MongoDB Vector Search: MongoDB supports storing and searching high-dimensional vectors, enabling efficient similarity search for RAG applications. Learn more: MongoDB Vector Search
Voyage AI Embeddings: Voyage AI provides state-of-the-art embedding models for text, supporting robust semantic search and retrieval. See: Voyage AI Documentation
OpenAI LLM Models: Azure OpenAI's models are used to generate answers based on retrieved context. More info: Azure OpenAI API

By combining these technologies, you can build scalable, production-ready RAG systems for advanced document understanding and question answering.

Setting Up Your Environment¶

First, we'll install the necessary libraries and configure our environment. These packages enable document processing, database connections, embedding generation, and AI model interaction. We're using Docling for document handling, PyMongo for MongoDB integration, VoyageAI for embeddings, and OpenAI client for generation capabilities.

In [124]:

Copied!





%%capture
%pip install docling~="2.7.0"
%pip install pymongo[srv]
%pip install voyageai
%pip install openai

import logging
import warnings

warnings.filterwarnings("ignore")
logging.getLogger("pymongo").setLevel(logging.ERROR)
%%capture
%pip install docling~="2.7.0"
%pip install pymongo[srv]
%pip install voyageai
%pip install openai

import logging
import warnings

warnings.filterwarnings("ignore")
logging.getLogger("pymongo").setLevel(logging.ERROR)

Part 1: Setting up Docling¶

Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.

The code below checks to see if a GPU is available, either via CUDA or MPS.

In [125]:

Copied!





import torch

# Check if GPU or MPS is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("MPS GPU is enabled.")
else:
    raise OSError(
        "No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured."
    )
import torch

# Check if GPU or MPS is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("MPS GPU is enabled.")
else:
    raise OSError(
        "No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured."
    )

MPS GPU is enabled.

Single-Document RAG Baseline¶

To begin, we will focus on a single seminal paper and treat it as the entire knowledge base. Building a Retrieval-Augmented Generation (RAG) pipeline on just one document serves as a clear, controlled baseline before scaling to multiple sources. This helps validate each stage of the workflow (parsing, chunking, embedding, retrieval, generation) without confounding factors introduced by inter-document noise.

In [126]:

Copied!





# Influential machine learning papers
source_urls = [
    "https://arxiv.org/pdf/1706.03762"  # Attention is All You Need
]
# Influential machine learning papers
source_urls = [
    "https://arxiv.org/pdf/1706.03762"  # Attention is All You Need
]

Convert Source Documents to Markdown¶

Convert each source URL to Markdown with Docling, reusing any already-converted document to avoid redundant downloads/parsing. Produces a dict mapping URLs to their Markdown content.

There are other methods that can be used to

In [127]:

Copied!

from pprint import pprint

from docling.document_converter import DocumentConverter

# Instantiate the doc converter
doc_converter = DocumentConverter()

# Since we want to use a single document, we will convert just the first URL. For multiple documents, you can use convert_all() method and then iterate through the list of converted documents.
pdf_doc = source_urls[0]
converted_doc = doc_converter.convert(pdf_doc).document
from pprint import pprint

from docling.document_converter import DocumentConverter

# Instantiate the doc converter
doc_converter = DocumentConverter()

# Since we want to use a single document, we will convert just the first URL. For multiple documents, you can use convert_all() method and then iterate through the list of converted documents.
pdf_doc = source_urls[0]
converted_doc = doc_converter.convert(pdf_doc).document

Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 73728.00it/s]

Post-process extracted document data¶

We use Docling's HierarchicalChunker() to perform hierarchy-aware chunking of our list of documents. This is meant to preserve some of the structure and relationships within the document, which enables more accurate and relevant retrieval in our RAG pipeline.

In [137]:

Copied!





from docling_core.transforms.chunker import HierarchicalChunker

# Initialize the chunker
chunker = HierarchicalChunker()

# Perform hierarchical chunking on the converted document and get text from chunks
chunks = list(chunker.chunk(converted_doc))
chunk_texts = [chunk.text for chunk in chunks]
chunk_texts[:20]  # Display a few chunk texts
from docling_core.transforms.chunker import HierarchicalChunker

# Initialize the chunker
chunker = HierarchicalChunker()

# Perform hierarchical chunking on the converted document and get text from chunks
chunks = list(chunker.chunk(converted_doc))
chunk_texts = [chunk.text for chunk in chunks]
chunk_texts[:20]  # Display a few chunk texts

Out[137]:

['arXiv:1706.03762v7 [cs.CL] 2 Aug 2023',
 'Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.',
 'Ashish Vaswani ∗ Google Brain avaswani@google.com',
 'Noam Shazeer ∗ Google Brain noam@google.com',
 'Niki Parmar ∗ Google Research nikip@google.com',
 'Jakob Uszkoreit ∗ Google Research usz@google.com',
 'Llion Jones ∗ Google Research llion@google.com',
 'Aidan N. Gomez ∗ † University of Toronto aidan@cs.toronto.edu',
 'Łukasz Kaiser ∗ Google Brain lukaszkaiser@google.com',
 'Illia Polosukhin ∗ ‡',
 'illia.polosukhin@gmail.com',
 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.',
 '$^{∗}$Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.',
 '$^{†}$Work performed while at Google Brain.',
 '$^{‡}$Work performed while at Google Research.',
 '31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.',
 'Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].',
 'Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states h$_{t}$ , as a function of the previous hidden state h$_{t}$$_{-}$$_{1}$ and the input for position t . This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.',
 'Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.',
 'In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.']

Part 2: VoyageAI (by MongoDB)¶

We will be using VoyageAI for embedding creation.¶

We will be using VoyageAI embedding model for converting the above chunks to embeddings, thereafter pushing them to MongoDB for further consumption.

VoyageAI has a load of offerings for embedding models, we will be using voyage-context-3 for best results in this case, which is a contextualized chunk embedding model, where chunk embedding encodes not only the chunk’s own content, but also captures the contextual information from the full document.

You can go through the blogpost to understand how it performas in comparison to other embedding models.

Create an account on Voyage and get you API key.

In [ ]:

Copied!





import voyageai

# Voyage API key
VOYAGE_API_KEY = "**********************"

# Initialize the VoyageAI client
vo = voyageai.Client(VOYAGE_API_KEY)
result = vo.contextualized_embed(inputs=[chunk_texts], model="voyage-context-3")
contextualized_chunk_embds = [emb for r in result.results for emb in r.embeddings]
import voyageai

# Voyage API key
VOYAGE_API_KEY = "**********************"

# Initialize the VoyageAI client
vo = voyageai.Client(VOYAGE_API_KEY)
result = vo.contextualized_embed(inputs=[chunk_texts], model="voyage-context-3")
contextualized_chunk_embds = [emb for r in result.results for emb in r.embeddings]

In [121]:

Copied!

# Check lengths to ensure they match
print("Chunk Texts Length:", chunk_texts.__len__())
print("Contextualized Chunk Embeddings Length:", contextualized_chunk_embds.__len__())
# Check lengths to ensure they match
print("Chunk Texts Length:", chunk_texts.__len__())
print("Contextualized Chunk Embeddings Length:", contextualized_chunk_embds.__len__())

Chunk Texts Length: 118
Contextualized Chunk Embeddings Length: 118

In [115]:

Copied!





# Combine chunks with their embeddings
chunk_data = [
    {"text": text, "embedding": emb}
    for text, emb in zip(chunk_texts, contextualized_chunk_embds)
]
# Combine chunks with their embeddings
chunk_data = [
    {"text": text, "embedding": emb}
    for text, emb in zip(chunk_texts, contextualized_chunk_embds)
]

Part 3: Inserting to MongoDB¶

With the generated embeddings prepared, we now insert them into MongoDB so they can be leveraged in the RAG pipeline.

MongoDB is an ideal vector store for RAG applications because:

It supports efficient vector search capabilities through Atlas Vector Search
It scales well for large document collections
It offers flexible querying options for combining semantic and traditional search
It provides robust indexing for fast retrieval

The chunks with their embeddings will be stored in a MongoDB collection, allowing us to perform similarity searches when responding to user queries.

In [ ]:

Copied!





# Insert to MongoDB
from pymongo import MongoClient

client = MongoClient(
    "mongodb+srv://*******.mongodb.net/"
)  # Replace with your MongoDB connection string
db = client["rag_db"]  # Database name
collection = db["documents"]  # Collection name

# Insert chunk data into MongoDB
response = collection.insert_many(chunk_data)
print(f"Inserted {len(response.inserted_ids)} documents into MongoDB.")
# Insert to MongoDB
from pymongo import MongoClient

client = MongoClient(
    "mongodb+srv://*******.mongodb.net/"
)  # Replace with your MongoDB connection string
db = client["rag_db"]  # Database name
collection = db["documents"]  # Collection name

# Insert chunk data into MongoDB
response = collection.insert_many(chunk_data)
print(f"Inserted {len(response.inserted_ids)} documents into MongoDB.")

Inserted 118 documents into MongoDB.

Creating Atlas Vector search index¶

Using pymongo we can create a vector index, that will help us search through our vectors and respond to user queries. This index is crucial for efficient similarity searches between user questions and our document chunks. MongoDB Atlas Vector Search provides fast and accurate retrieval of semantically related content, which forms the foundation of our RAG pipeline.

In [117]:

Copied!





from pymongo.operations import SearchIndexModel

# Create your index model, then create the search index
search_index_model = SearchIndexModel(
    definition={
        "fields": [
            {
                "type": "vector",
                "path": "embedding",
                "numDimensions": 1024,
                "similarity": "dotProduct",
            }
        ]
    },
    name="vector_index",
    type="vectorSearch",
)
result = collection.create_search_index(model=search_index_model)
print("New search index named " + result + " is building.")
from pymongo.operations import SearchIndexModel

# Create your index model, then create the search index
search_index_model = SearchIndexModel(
    definition={
        "fields": [
            {
                "type": "vector",
                "path": "embedding",
                "numDimensions": 1024,
                "similarity": "dotProduct",
            }
        ]
    },
    name="vector_index",
    type="vectorSearch",
)
result = collection.create_search_index(model=search_index_model)
print("New search index named " + result + " is building.")

New search index named vector_index is building.

Query the vectorized data¶

To perform a query on the vectorized data stored in MongoDB, we can use the $vectorSearch aggregation pipeline. This powerful feature of MongoDB Atlas enables semantic search capabilities by finding documents based on vector similarity.

When executing a vector search query:

MongoDB computes the similarity between the query vector and vectors stored in the collection
The documents are ranked by their similarity score
The top-N most similar results are returned

This enables us to find semantically related content rather than relying on exact keyword matches. The similarity metric we're using (dot product) measures the cosine similarity between vectors, allowing us to identify content that is conceptually similar even if it uses different terminology.

For RAG applications, this vector search capability is crucial as it allows us to retrieve the most relevant context from our document collection based on the semantic meaning of a user's query, providing the foundation for generating accurate and contextually appropriate responses.

Part 4: Perform RAG on parsed articles¶

We specify a prompt that includes the field we want to search through in the database (in this case it's text), a query that includes our search term, and the number of retrieved results to use in the generation.

In [ ]:

Copied!





import os

from openai import AzureOpenAI
from rich.console import Console
from rich.panel import Panel

# Create MongoDB vector search query for "Attention is All You Need"
# (prompt already defined above, reuse if present; else keep this definition)
prompt = "Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context."

# Generate embedding for the query using VoyageAI (vo already initialized earlier)
query_embd_context = (
    vo.contextualized_embed(
        inputs=[[prompt]], model="voyage-context-3", input_type="query"
    )
    .results[0]
    .embeddings[0]
)

# Vector search pipeline
search_pipeline = [
    {
        "$vectorSearch": {
            "index": "vector_index",
            "path": "embedding",
            "queryVector": query_embd_context,
            "numCandidates": 10,
            "limit": 10,
        }
    },
    {"$project": {"text": 1, "_id": 0, "score": {"$meta": "vectorSearchScore"}}},
]

results = list(collection.aggregate(search_pipeline))
if not results:
    raise ValueError(
        "No vector search results returned. Verify the index is built before querying."
    )

context_texts = [doc["text"] for doc in results]
combined_context = "\n\n".join(context_texts)

# Expect these environment variables to be set (do NOT hardcode secrets):
#   AZURE_OPENAI_API_KEY
#   AZURE_OPENAI_ENDPOINT   -> e.g. https://your-resource-name.openai.azure.com/
#   AZURE_OPENAI_API_VERSION (optional, else fallback)
AZURE_OPENAI_API_KEY = "**********************"
AZURE_OPENAI_ENDPOINT = "**********************"
AZURE_OPENAI_API_VERSION = "**********************"

# Initialize Azure OpenAI client (endpoint must NOT include path segments)
client = AzureOpenAI(
    api_key=AZURE_OPENAI_API_KEY,
    azure_endpoint=AZURE_OPENAI_ENDPOINT.rstrip("/"),
    api_version=AZURE_OPENAI_API_VERSION,
)

# Chat completion using retrieved context
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Azure deployment name
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Use only the provided context to answer questions. If the context is insufficient, say so.",
        },
        {
            "role": "user",
            "content": f"Context:\n{combined_context}\n\nQuestion: {prompt}",
        },
    ],
    temperature=0.2,
)

response_text = response.choices[0].message.content

console = Console()
console.print(Panel(f"{prompt}", title="Prompt", border_style="bold red"))
console.print(
    Panel(response_text, title="Generated Content", border_style="bold green")
)
import os

from openai import AzureOpenAI
from rich.console import Console
from rich.panel import Panel

# Create MongoDB vector search query for "Attention is All You Need"
# (prompt already defined above, reuse if present; else keep this definition)
prompt = "Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context."

# Generate embedding for the query using VoyageAI (vo already initialized earlier)
query_embd_context = (
    vo.contextualized_embed(
        inputs=[[prompt]], model="voyage-context-3", input_type="query"
    )
    .results[0]
    .embeddings[0]
)

# Vector search pipeline
search_pipeline = [
    {
        "$vectorSearch": {
            "index": "vector_index",
            "path": "embedding",
            "queryVector": query_embd_context,
            "numCandidates": 10,
            "limit": 10,
        }
    },
    {"$project": {"text": 1, "_id": 0, "score": {"$meta": "vectorSearchScore"}}},
]

results = list(collection.aggregate(search_pipeline))
if not results:
    raise ValueError(
        "No vector search results returned. Verify the index is built before querying."
    )

context_texts = [doc["text"] for doc in results]
combined_context = "\n\n".join(context_texts)

# Expect these environment variables to be set (do NOT hardcode secrets):
#   AZURE_OPENAI_API_KEY
#   AZURE_OPENAI_ENDPOINT   -> e.g. https://your-resource-name.openai.azure.com/
#   AZURE_OPENAI_API_VERSION (optional, else fallback)
AZURE_OPENAI_API_KEY = "**********************"
AZURE_OPENAI_ENDPOINT = "**********************"
AZURE_OPENAI_API_VERSION = "**********************"

# Initialize Azure OpenAI client (endpoint must NOT include path segments)
client = AzureOpenAI(
    api_key=AZURE_OPENAI_API_KEY,
    azure_endpoint=AZURE_OPENAI_ENDPOINT.rstrip("/"),
    api_version=AZURE_OPENAI_API_VERSION,
)

# Chat completion using retrieved context
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Azure deployment name
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Use only the provided context to answer questions. If the context is insufficient, say so.",
        },
        {
            "role": "user",
            "content": f"Context:\n{combined_context}\n\nQuestion: {prompt}",
        },
    ],
    temperature=0.2,
)

response_text = response.choices[0].message.content

console = Console()
console.print(Panel(f"{prompt}", title="Prompt", border_style="bold red"))
console.print(
    Panel(response_text, title="Generated Content", border_style="bold green")
)

╭──────────────────────────────────────────────────── Prompt ─────────────────────────────────────────────────────╮
│ Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context.               │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

╭─────────────────────────────────────────────── Generated Content ───────────────────────────────────────────────╮
│ 1. **Introduction of the Transformer Architecture**: The Transformer model is a novel architecture that relies  │
│ entirely on attention mechanisms, eliminating the need for recurrence and convolutions. This allows for         │
│ significantly more parallelization during training and leads to superior performance in tasks such as machine   │
│ translation.                                                                                                    │
│                                                                                                                 │
│ 2. **Performance and Efficiency**: The Transformer achieves state-of-the-art results on machine translation     │
│ tasks, such as a BLEU score of 28.4 on the WMT 2014 English-to-German task and 41.8 on the English-to-French    │
│ task, while requiring much less training time (3.5 days on eight GPUs) compared to previous models. This        │
│ demonstrates the efficiency and effectiveness of the architecture.                                              │
│                                                                                                                 │
│ 3. **Self-Attention Mechanism**: The self-attention layers in both the encoder and decoder allow for each       │
│ position to attend to all other positions in the sequence, enabling the model to capture global dependencies.   │
│ This mechanism is more computationally efficient than recurrent layers, which require sequential operations,    │
│ thus improving the model's speed and scalability.                                                               │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

This notebook demonstrated a powerful RAG pipeline using MongoDB, VoyageAI, and Azure OpenAI. By combining MongoDB's vector search capabilities with VoyageAI's embeddings and Azure OpenAI's language models, we created an intelligent document retrieval system.

Key Achievements:¶

Processed documents with Docling
Generated contextual embeddings with VoyageAI
Stored vectors in MongoDB Atlas
Implemented semantic search for relevant context retrieval
Generated accurate responses with Azure OpenAI

Next Steps:¶

Expand your knowledge base with more documents
Experiment with chunking and embedding parameters
Build a user interface
Implement evaluation metrics
Deploy to production with proper scaling

Start building your own intelligent document retrieval system today!