Skip to content

Open In Colab

EPUB Document Conversion

This example demonstrates how to convert EPUB (Electronic Publication) files using Docling's EPUB backend.

EPUB is a widely-used open standard format for e-books and digital publications. It's based on XHTML and can contain text, images, and metadata in a structured ZIP archive.

What you'll learn

  • How to convert EPUB files to structured DoclingDocument format
  • How to extract and handle images from EPUB archives
  • How to access EPUB metadata (title, author, language, etc.)
  • How to export EPUB content to various formats (Markdown, JSON, etc.)
  • Understanding EPUB structure and conversion features

Setup

Install Docling:

%pip install -q docling

Download Sample EPUB File

For this example, we'll use a public domain EPUB file from Standard Ebooks, a volunteer-driven project that produces high-quality, carefully formatted public domain ebooks.

The book we'll use is "Poetry" by Sarah Louisa Forten Purvis, available at: https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry

Standard Ebooks productions are in the US public domain.

import urllib.request
from pathlib import Path

# Create directory for EPUB data
data_dir = Path("epub_data")
data_dir.mkdir(exist_ok=True)

# Download sample EPUB file from Standard Ebooks
# Note: We use the Docling test data mirror for reliable downloads in notebooks
# Original source: https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry
epub_file = data_dir / "sarah-louisa-forten-purvis_poetry.epub"
if not epub_file.exists():
    print("Downloading sample EPUB file...")
    print("Source: 'Poetry' by Sarah Louisa Forten Purvis from Standard Ebooks")
    # Using Docling test data for reliable notebook execution
    epub_url = "https://raw.githubusercontent.com/docling-project/docling/main/tests/data/epub/epub_purvis_poetry.epub"
    urllib.request.urlretrieve(epub_url, epub_file)
    print(f"Downloaded: {epub_file}")
    print(f"File size: {epub_file.stat().st_size / 1024:.1f} KB")
else:
    print(f"Using existing file: {epub_file}")
    print(f"File size: {epub_file.stat().st_size / 1024:.1f} KB")
Using existing file: epub_data/sarah-louisa-forten-purvis_poetry.epub
File size: 393.1 KB

Basic EPUB Conversion

Let's start with a simple conversion using the default settings:

from docling.document_converter import DocumentConverter

# Create converter instance
converter = DocumentConverter()

# Convert the EPUB file
print(f"Converting EPUB document: {epub_file}")
result = converter.convert(epub_file)
doc = result.document

print("\nConversion successful!")
print(f"Document name: {doc.name}")
print(f"Number of items: {len(list(doc.iterate_items()))}")
Converting EPUB document: epub_data/sarah-louisa-forten-purvis_poetry.epub

Conversion successful!
Document name: sarah-louisa-forten-purvis_poetry
Number of items: 168

Inspect Document Structure

Let's examine the structure of the converted document:

from docling_core.types.doc import DocItemLabel

# Count items by type
item_counts = {}
for item, _ in doc.iterate_items():
    label = item.label
    item_counts[label] = item_counts.get(label, 0) + 1

print("Document structure:")
for label, count in sorted(item_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"  {label.value}: {count}")
Document structure:
  text: 144
  section_header: 18
  picture: 3
  caption: 2
  title: 1

View Sample Content

Let's look at some of the extracted content:

# Display first few text items
print("Sample text content:\n")
text_count = 0
for item, _ in doc.iterate_items():
    if item.label == DocItemLabel.TEXT and text_count < 5:
        print(f"- {item.text[:150]}..." if len(item.text) > 150 else f"- {item.text}")
        print()
        text_count += 1
Sample text content:

- By

- Sarah Louisa Forten Purvis

- .

- This ebook is the product of many hours of hard work by volunteers for

- Standard Ebooks

Export to Markdown (Basic)

Export the document to Markdown format without images:

# Export to Markdown without images
markdown_content = doc.export_to_markdown()

# Display first 1500 characters
print("Markdown export (first 1500 characters):\n")
print(markdown_content[:1500])
print("\n...")

# Save to file using save_as_markdown (faster than write_text)
output_md = data_dir / "output_basic.md"
doc.save_as_markdown(output_md)
print(f"\nFull markdown saved to: {output_md}")
Markdown export (first 1500 characters):

# Poetry

By **Sarah Louisa Forten Purvis** .

<!-- image -->

## Imprint

The Standard Ebooks logo.

<!-- image -->

This ebook is the product of many hours of hard work by volunteers for [Standard Ebooks](https://standardebooks.org/) , and builds on the hard work of other literature lovers made possible by the public domain.

This particular ebook is based on digital scans from the [Internet Archive](https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry#page-scans) .

The source text and artwork in this ebook are believed to be in the United States public domain; that is, they are believed to be free of copyright restrictions in the United States. They may still be copyrighted in other countries, so users located outside of the United States must check their local laws before using this ebook. The creators of, and contributors to, this ebook dedicate their contributions to the worldwide public domain via the terms in the [CC0 1.0 Universal Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/) . For full license information, see the [Uncopyright](uncopyright.xhtml) at the end of this ebook.

Standard Ebooks is a volunteer-driven project that produces ebook editions of public domain literature using modern typography, technology, and editorial standards, and distributes them free of cost. You can download this and other ebooks carefully produced for true book lovers at [standardebooks.org](https://standardebooks.org/) .

## The Grave of t

...

Full markdown saved to: epub_data/output_basic.md

EPUB Conversion with Image Extraction

Now let's configure the converter to extract images from the EPUB archive:

from docling.datamodel.backend_options import EpubBackendOptions
from docling.document_converter import DocumentConverter, EpubFormatOption

# Configure EPUB options to extract images
epub_options = EpubBackendOptions(
    fetch_images=True,  # Extract images from EPUB archive
    enable_local_fetch=True,  # Allow reading local image files
    enable_remote_fetch=False,  # Disable fetching remote images
)

# Create converter with EPUB options
converter_with_images = DocumentConverter(
    format_options={"epub": EpubFormatOption(backend_options=epub_options)}
)

# Convert the EPUB with image extraction
print("Converting EPUB with image extraction...")
result_with_images = converter_with_images.convert(epub_file)
doc_with_images = result_with_images.document

print("\nConversion with images successful!")
print(f"Number of items: {len(list(doc_with_images.iterate_items()))}")
Converting EPUB with image extraction...

Conversion with images successful!
Number of items: 168

Export with Embedded Images

Export the document with images embedded as base64 data URIs:

# Export with embedded images (base64-encoded)
markdown_with_images = doc_with_images.export_to_markdown(image_mode="embedded")

# Display first 1500 characters
print("Markdown with embedded images (first 1500 characters):\n")
print(markdown_with_images[:1500])
print("\n...")

# Save to file using save_as_markdown
output_md_images = data_dir / "output_with_images.md"
doc_with_images.save_as_markdown(output_md_images, image_mode="embedded")
print(f"\nMarkdown with embedded images saved to: {output_md_images}")
Markdown with embedded images (first 1500 characters):

# Poetry

By **Sarah Louisa Forten Purvis** .

![Image](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAABXgAAAGkCAQAAAAhsYX7AAB0QElEQVR4nO3dCZzM5R8H8M+zdt33fYSQmw6kFIVUypW7coZcHbqTDnSQdPlXihxROkhylEoUpaQI5YhUKPd9W3a//9d3ZrHH/H7z/GZml939vOf1mq2d5/jNb2bNM8/veb5fgIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIiIgyLXOuD4CIyIZkRyVURmHkRV7kRjyO4BAOYBf+wAZzgueQiIicccBLlAlIVeQMo/pBxOGg2Y1zQAyq4To0wiUoiyiHQnHYhBX4BgvMmgj2XAglkZoOmX/O9GVwKdLeenMk6S+kMMqG1JJgP4AdydtzLH4Joj338auJ935gvt6yhHB2T5pVofVGROcnDniJMgFZjsvCbuQ4tmEjfsNv+N5sQKoTg/rohuYo5v//k9iItdiOgziIw4hCLuRBPpREFZRDzOlK2zAbk/GDkQj0fy9GITV9ba4/01d2HEPau8Z8l/QX0htjwmrxALZgDVbgZyw2Ls9IhmOg57YHmP+FdlDyEEZ6rvS0GRxab0R0fuKAlyiTDHjX4miItfMiCwqgQOJfbcaX+AALQ51zC0aKoy+6oLz+93+Yj2/wIzbilEPpGFyEemiMRqenZP/Eu3jL7AzzGO7FqD3YitSQBxemGPAKViDtVEKuZANend39tVd831BaM8iPKBRDjrO/Oobv8BE+NgcDlZdsWI5qR7HWsv0o/b52GBebv70fm1yEVcixHLbfgaojO7ASdU2s976IiIjoHJLlIpcJwrrlkIukmQyUqbJTEmyRQZI/4sdaWv4nR7X5TfKsVPV0jDVkmGzxH9sReUVKhXUc94qMCvOcOd2a6BHOS9RXdpFjqdRX4NsiPYIGSZ5vL5G3w2y1kFwm3eRl+UXi/K/CURknFQOe3cvlZJzUt275fW3tK/E8RSNGvhWZaN3P1XrksXIuFpgQERHRuR/wnr0ZuUSelj/9Q5qD8ozkjNhxFpTRckIkTqZLY4kK6eiipInMkHg9tuPyquQL+Vg44A3jPVJYussC/7D3lEyS4gHO7zCRdZLDsr1CskPb6uH5Vewrsk

...

Markdown with embedded images saved to: epub_data/output_with_images.md

Check for Images in Document

Let's check if the EPUB contains any images:

# Check for pictures in the document
from docling_core.types.doc import PictureItem

pictures = [
    item for item, _ in doc_with_images.iterate_items() if isinstance(item, PictureItem)
]

if pictures:
    print(f"Found {len(pictures)} image(s) in the EPUB:")
    for i, pic in enumerate(pictures[:5], 1):  # Show first 5
        print(f"  {i}. Image at position {pic.self_ref}")
        if hasattr(pic, "image") and pic.image:
            print(
                f"     Size: {pic.image.size if hasattr(pic.image, 'size') else 'unknown'}"
            )
else:
    print("No images found in this EPUB.")
    print("Note: This particular EPUB (poetry collection) may not contain images.")
Found 3 image(s) in the EPUB:
  1. Image at position #/pictures/0
     Size: width=1400.0 height=420.0
  2. Image at position #/pictures/1
     Size: width=220.0 height=140.0
  3. Image at position #/pictures/2
     Size: width=220.0 height=140.0

Export to JSON

Export the complete document structure to JSON:

import json

# Export to JSON
output_json = data_dir / "output.json"
doc_with_images.save_as_json(output_json)

print(f"Document exported to JSON: {output_json}")
print(f"File size: {output_json.stat().st_size / 1024:.2f} KB")

# Display a sample of the JSON structure
with open(output_json) as f:
    json_data = json.load(f)
    print("\nJSON structure (top-level keys):")
    for key in json_data.keys():
        print(f"  - {key}")
Document exported to JSON: epub_data/output.json
File size: 162.79 KB

JSON structure (top-level keys):
  - schema_name
  - version
  - name
  - origin
  - furniture
  - body
  - groups
  - texts
  - pictures
  - tables
  - key_value_items
  - form_items
  - pages

Understanding EPUB Features

The EPUB backend provides several key features:

Structure Parsing

  • Parses EPUB structure: Reads the container.xml and content.opf files to understand the book's organization
  • Preserves reading order: Processes content files in the order specified by the spine element
  • Handles internal links: Automatically fixes cross-file references (e.g., footnote links) when combining XHTML files

Metadata Extraction

  • Retrieves title, author, language, and other Dublin Core metadata from the OPF file
  • Metadata is accessible through the DoclingDocument structure

Image Handling

  • Can extract and embed images from the EPUB archive when fetch_images=True
  • Supports multiple export modes:
  • image_mode='placeholder' (default): Replaces images with <!-- image --> comments
  • image_mode='embedded': Embeds images as base64 data URIs in the markdown

HTML Backend Integration

  • Leverages the existing HTML backend for robust XHTML content processing
  • Ensures consistent handling of HTML elements across different document types

Batch Conversion Example

Using Python API

Here's how you would convert multiple EPUB files in a directory using Python:

from pathlib import Path
from docling.document_converter import DocumentConverter

converter = DocumentConverter()

# Convert all EPUB files in a directory
epub_dir = Path("path/to/epub/directory")
for epub_file in epub_dir.glob("*.epub"):
    print(f"Converting {epub_file.name}...")
    result = converter.convert(str(epub_file))

    # Save to markdown with embedded images
    output_path = epub_file.with_suffix(".md")
    result.document.save_as_markdown(output_path, image_mode="embedded")
    print(f"Saved to {output_path}")

Using CLI

Alternatively, you can use the Docling CLI for batch conversion, which is even simpler:

docling --to md --from epub path/to/epub/directory

Known Limitations

Internal anchor links (such as footnote references) are partially supported:

  • Links are converted: References like [1](#note-1) will appear in the output
  • Anchor targets are not preserved: The corresponding anchor IDs (e.g., id="note-1") are lost during HTML-to-DoclingDocument conversion
  • Impact: Clicking on footnote links in the exported Markdown won't jump to the footnote location

This is a limitation of the underlying HTML backend's conversion process, which focuses on extracting content structure rather than preserving HTML anchor IDs.

Example:

<!-- In the text -->
...five versts [1](#note-1) from Durnovka...

<!-- At the end (footnote section) -->
1. A verst is two-thirds of a mile. [â†Šī¸Ž](#noteref-1)

The links [1](#note-1) and [â†Šī¸Ž](#noteref-1) will be present, but the anchor targets they reference won't be accessible in the Markdown output.

Technical Details

EPUB files are ZIP archives containing: - XHTML content files - Metadata (OPF file) - Navigation structure - Images and other resources

The backend processing workflow: 1. Extracts the ZIP archive 2. Parses the container.xml to locate the OPF file 3. Reads the OPF file to get metadata and reading order 4. Combines all XHTML content files in spine order 5. Fixes internal cross-file links 6. Delegates to the HTML backend for final processing

Supported EPUB Versions

The backend supports EPUB 2 and EPUB 3 formats, which are the most common versions used for e-books.

Summary

In this example, we demonstrated:

✅ How to convert EPUB files to DoclingDocument format
✅ How to extract and handle images from EPUB archives
✅ How to export EPUB content to Markdown and JSON formats
✅ Different image export modes (placeholder, embedded, reference)
✅ Understanding EPUB structure and conversion features

Key Points

  • Simple conversion: Basic EPUB conversion works out of the box with DocumentConverter()
  • Image extraction: Enable with fetch_images=True in EpubBackendOptions
  • Flexible export: Choose between embedded images or placeholders
  • Metadata preservation: EPUB metadata is extracted and accessible in the document
  • Reading order: Content is processed in the correct reading order as specified in the EPUB

Next Steps

  • Try converting your own EPUB files
  • Experiment with different image export modes
  • Combine EPUB conversion with other Docling features like chunking for RAG applications
  • Explore the DoclingDocument API for more advanced document manipulation