EPUB Document Conversion
This example demonstrates how to convert EPUB (Electronic Publication) files using Docling's EPUB backend.
EPUB is a widely-used open standard format for e-books and digital publications. It's based on XHTML and can contain text, images, and metadata in a structured ZIP archive.
What you'll learn
- How to convert EPUB files to structured DoclingDocument format
- How to extract and handle images from EPUB archives
- How to access EPUB metadata (title, author, language, etc.)
- How to export EPUB content to various formats (Markdown, JSON, etc.)
- Understanding EPUB structure and conversion features
Setup
Install Docling:
%pip install -q docling
Download Sample EPUB File
For this example, we'll use a public domain EPUB file from Standard Ebooks, a volunteer-driven project that produces high-quality, carefully formatted public domain ebooks.
The book we'll use is "Poetry" by Sarah Louisa Forten Purvis, available at: https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry
Standard Ebooks productions are in the US public domain.
import urllib.request
from pathlib import Path
# Create directory for EPUB data
data_dir = Path("epub_data")
data_dir.mkdir(exist_ok=True)
# Download sample EPUB file from Standard Ebooks
# Note: We use the Docling test data mirror for reliable downloads in notebooks
# Original source: https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry
epub_file = data_dir / "sarah-louisa-forten-purvis_poetry.epub"
if not epub_file.exists():
print("Downloading sample EPUB file...")
print("Source: 'Poetry' by Sarah Louisa Forten Purvis from Standard Ebooks")
# Using Docling test data for reliable notebook execution
epub_url = "https://raw.githubusercontent.com/docling-project/docling/main/tests/data/epub/epub_purvis_poetry.epub"
urllib.request.urlretrieve(epub_url, epub_file)
print(f"Downloaded: {epub_file}")
print(f"File size: {epub_file.stat().st_size / 1024:.1f} KB")
else:
print(f"Using existing file: {epub_file}")
print(f"File size: {epub_file.stat().st_size / 1024:.1f} KB")
Using existing file: epub_data/sarah-louisa-forten-purvis_poetry.epub
File size: 393.1 KB
Basic EPUB Conversion
Let's start with a simple conversion using the default settings:
from docling.document_converter import DocumentConverter
# Create converter instance
converter = DocumentConverter()
# Convert the EPUB file
print(f"Converting EPUB document: {epub_file}")
result = converter.convert(epub_file)
doc = result.document
print("\nConversion successful!")
print(f"Document name: {doc.name}")
print(f"Number of items: {len(list(doc.iterate_items()))}")
Converting EPUB document: epub_data/sarah-louisa-forten-purvis_poetry.epub
Conversion successful!
Document name: sarah-louisa-forten-purvis_poetry
Number of items: 168
Inspect Document Structure
Let's examine the structure of the converted document:
from docling_core.types.doc import DocItemLabel
# Count items by type
item_counts = {}
for item, _ in doc.iterate_items():
label = item.label
item_counts[label] = item_counts.get(label, 0) + 1
print("Document structure:")
for label, count in sorted(item_counts.items(), key=lambda x: x[1], reverse=True):
print(f" {label.value}: {count}")
Document structure:
text: 144
section_header: 18
picture: 3
caption: 2
title: 1
View Sample Content
Let's look at some of the extracted content:
# Display first few text items
print("Sample text content:\n")
text_count = 0
for item, _ in doc.iterate_items():
if item.label == DocItemLabel.TEXT and text_count < 5:
print(f"- {item.text[:150]}..." if len(item.text) > 150 else f"- {item.text}")
print()
text_count += 1
Sample text content:
- By
- Sarah Louisa Forten Purvis
- .
- This ebook is the product of many hours of hard work by volunteers for
- Standard Ebooks
Export to Markdown (Basic)
Export the document to Markdown format without images:
# Export to Markdown without images
markdown_content = doc.export_to_markdown()
# Display first 1500 characters
print("Markdown export (first 1500 characters):\n")
print(markdown_content[:1500])
print("\n...")
# Save to file using save_as_markdown (faster than write_text)
output_md = data_dir / "output_basic.md"
doc.save_as_markdown(output_md)
print(f"\nFull markdown saved to: {output_md}")
Markdown export (first 1500 characters):
# Poetry
By **Sarah Louisa Forten Purvis** .
<!-- image -->
## Imprint
The Standard Ebooks logo.
<!-- image -->
This ebook is the product of many hours of hard work by volunteers for [Standard Ebooks](https://standardebooks.org/) , and builds on the hard work of other literature lovers made possible by the public domain.
This particular ebook is based on digital scans from the [Internet Archive](https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry#page-scans) .
The source text and artwork in this ebook are believed to be in the United States public domain; that is, they are believed to be free of copyright restrictions in the United States. They may still be copyrighted in other countries, so users located outside of the United States must check their local laws before using this ebook. The creators of, and contributors to, this ebook dedicate their contributions to the worldwide public domain via the terms in the [CC0 1.0 Universal Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/) . For full license information, see the [Uncopyright](uncopyright.xhtml) at the end of this ebook.
Standard Ebooks is a volunteer-driven project that produces ebook editions of public domain literature using modern typography, technology, and editorial standards, and distributes them free of cost. You can download this and other ebooks carefully produced for true book lovers at [standardebooks.org](https://standardebooks.org/) .
## The Grave of t
...
Full markdown saved to: epub_data/output_basic.md
EPUB Conversion with Image Extraction
Now let's configure the converter to extract images from the EPUB archive:
from docling.datamodel.backend_options import EpubBackendOptions
from docling.document_converter import DocumentConverter, EpubFormatOption
# Configure EPUB options to extract images
epub_options = EpubBackendOptions(
fetch_images=True, # Extract images from EPUB archive
enable_local_fetch=True, # Allow reading local image files
enable_remote_fetch=False, # Disable fetching remote images
)
# Create converter with EPUB options
converter_with_images = DocumentConverter(
format_options={"epub": EpubFormatOption(backend_options=epub_options)}
)
# Convert the EPUB with image extraction
print("Converting EPUB with image extraction...")
result_with_images = converter_with_images.convert(epub_file)
doc_with_images = result_with_images.document
print("\nConversion with images successful!")
print(f"Number of items: {len(list(doc_with_images.iterate_items()))}")
Converting EPUB with image extraction...
Conversion with images successful!
Number of items: 168
Export with Embedded Images
Export the document with images embedded as base64 data URIs:
# Export with embedded images (base64-encoded)
markdown_with_images = doc_with_images.export_to_markdown(image_mode="embedded")
# Display first 1500 characters
print("Markdown with embedded images (first 1500 characters):\n")
print(markdown_with_images[:1500])
print("\n...")
# Save to file using save_as_markdown
output_md_images = data_dir / "output_with_images.md"
doc_with_images.save_as_markdown(output_md_images, image_mode="embedded")
print(f"\nMarkdown with embedded images saved to: {output_md_images}")
Markdown with embedded images (first 1500 characters):
# Poetry
By **Sarah Louisa Forten Purvis** .
 if isinstance(item, PictureItem)
]
if pictures:
print(f"Found {len(pictures)} image(s) in the EPUB:")
for i, pic in enumerate(pictures[:5], 1): # Show first 5
print(f" {i}. Image at position {pic.self_ref}")
if hasattr(pic, "image") and pic.image:
print(
f" Size: {pic.image.size if hasattr(pic.image, 'size') else 'unknown'}"
)
else:
print("No images found in this EPUB.")
print("Note: This particular EPUB (poetry collection) may not contain images.")
Found 3 image(s) in the EPUB:
1. Image at position #/pictures/0
Size: width=1400.0 height=420.0
2. Image at position #/pictures/1
Size: width=220.0 height=140.0
3. Image at position #/pictures/2
Size: width=220.0 height=140.0
Export to JSON
Export the complete document structure to JSON:
import json
# Export to JSON
output_json = data_dir / "output.json"
doc_with_images.save_as_json(output_json)
print(f"Document exported to JSON: {output_json}")
print(f"File size: {output_json.stat().st_size / 1024:.2f} KB")
# Display a sample of the JSON structure
with open(output_json) as f:
json_data = json.load(f)
print("\nJSON structure (top-level keys):")
for key in json_data.keys():
print(f" - {key}")
Document exported to JSON: epub_data/output.json
File size: 162.79 KB
JSON structure (top-level keys):
- schema_name
- version
- name
- origin
- furniture
- body
- groups
- texts
- pictures
- tables
- key_value_items
- form_items
- pages
Understanding EPUB Features
The EPUB backend provides several key features:
Structure Parsing
- Parses EPUB structure: Reads the
container.xmlandcontent.opffiles to understand the book's organization - Preserves reading order: Processes content files in the order specified by the spine element
- Handles internal links: Automatically fixes cross-file references (e.g., footnote links) when combining XHTML files
Metadata Extraction
- Retrieves title, author, language, and other Dublin Core metadata from the OPF file
- Metadata is accessible through the DoclingDocument structure
Image Handling
- Can extract and embed images from the EPUB archive when
fetch_images=True - Supports multiple export modes:
image_mode='placeholder'(default): Replaces images with<!-- image -->commentsimage_mode='embedded': Embeds images as base64 data URIs in the markdown
HTML Backend Integration
- Leverages the existing HTML backend for robust XHTML content processing
- Ensures consistent handling of HTML elements across different document types
Batch Conversion Example
Using Python API
Here's how you would convert multiple EPUB files in a directory using Python:
from pathlib import Path
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
# Convert all EPUB files in a directory
epub_dir = Path("path/to/epub/directory")
for epub_file in epub_dir.glob("*.epub"):
print(f"Converting {epub_file.name}...")
result = converter.convert(str(epub_file))
# Save to markdown with embedded images
output_path = epub_file.with_suffix(".md")
result.document.save_as_markdown(output_path, image_mode="embedded")
print(f"Saved to {output_path}")
Using CLI
Alternatively, you can use the Docling CLI for batch conversion, which is even simpler:
docling --to md --from epub path/to/epub/directory
Known Limitations
Internal Anchor Links
Internal anchor links (such as footnote references) are partially supported:
- Links are converted: References like
[1](#note-1)will appear in the output - Anchor targets are not preserved: The corresponding anchor IDs (e.g.,
id="note-1") are lost during HTML-to-DoclingDocument conversion - Impact: Clicking on footnote links in the exported Markdown won't jump to the footnote location
This is a limitation of the underlying HTML backend's conversion process, which focuses on extracting content structure rather than preserving HTML anchor IDs.
Example:
<!-- In the text -->
...five versts [1](#note-1) from Durnovka...
<!-- At the end (footnote section) -->
1. A verst is two-thirds of a mile. [âŠī¸](#noteref-1)
The links [1](#note-1) and [âŠī¸](#noteref-1) will be present, but the anchor targets they reference won't be accessible in the Markdown output.
Technical Details
EPUB files are ZIP archives containing: - XHTML content files - Metadata (OPF file) - Navigation structure - Images and other resources
The backend processing workflow:
1. Extracts the ZIP archive
2. Parses the container.xml to locate the OPF file
3. Reads the OPF file to get metadata and reading order
4. Combines all XHTML content files in spine order
5. Fixes internal cross-file links
6. Delegates to the HTML backend for final processing
Supported EPUB Versions
The backend supports EPUB 2 and EPUB 3 formats, which are the most common versions used for e-books.
Summary
In this example, we demonstrated:
â
How to convert EPUB files to DoclingDocument format
â
How to extract and handle images from EPUB archives
â
How to export EPUB content to Markdown and JSON formats
â
Different image export modes (placeholder, embedded, reference)
â
Understanding EPUB structure and conversion features
Key Points
- Simple conversion: Basic EPUB conversion works out of the box with
DocumentConverter() - Image extraction: Enable with
fetch_images=TrueinEpubBackendOptions - Flexible export: Choose between embedded images or placeholders
- Metadata preservation: EPUB metadata is extracted and accessible in the document
- Reading order: Content is processed in the correct reading order as specified in the EPUB
Next Steps
- Try converting your own EPUB files
- Experiment with different image export modes
- Combine EPUB conversion with other Docling features like chunking for RAG applications
- Explore the DoclingDocument API for more advanced document manipulation