Processing audio and video

Docling's ASR (Automatic Speech Recognition) pipeline lets you convert audio and video files into a structured DoclingDocument — the same intermediate representation used for PDFs, DOCX files, and everything else. From there you can export to Markdown, JSON, HTML, or DocTags, and plug the result directly into RAG pipelines, summarizers, or search indexes.

Under the hood, Docling uses Whisper Turbo for transcription. On Apple Silicon it automatically selects mlx-whisper for optimized local inference; on all other hardware it falls back to native Whisper. You don't configure this — it just picks the right backend.

Supported formats

Type	Formats
Audio	WAV, MP3, M4A, AAC, OGG, FLAC
Video	MP4, AVI, MOV

For video files, Docling extracts the audio track automatically before transcription. You don't need to run FFmpeg manually.

ffmpeg required

Some audio formats (M4A, AAC, OGG, FLAC) and all video formats require ffmpeg to be installed and available on your PATH. Install it with your system package manager — e.g. brew install ffmpeg on macOS or apt-get install ffmpeg on Debian-based Linux.

Installation

The ASR pipeline is an optional extra. Install it alongside the base package:

pip install "docling[asr]"

Or with uv:

uv add "docling[asr]"

Basic usage

from pathlib import Path

from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline

pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

converter = DocumentConverter(
    format_options={
        InputFormat.AUDIO: AudioFormatOption(
            pipeline_cls=AsrPipeline,
            pipeline_options=pipeline_options,
        )
    }
)

result = converter.convert(Path("recording.mp3"))
doc = result.document

# Export to Markdown
print(doc.export_to_markdown())

The same code works for video — pass an .mp4, .mov, or .avi path and Docling handles the rest.

Exporting to different formats

result.document is a DoclingDocument. You can export it to any supported format:

doc.export_to_markdown()   # Markdown
doc.export_to_dict()       # JSON-serializable dict
doc.export_to_html()       # HTML
doc.export_to_doctags()    # DocTags

See Serialization for more on export options.

Understanding the output

The ASR pipeline produces paragraph-level Markdown with timestamps per segment:

[time: 0.0-4.0]  Shakespeare on Scenery by Oscar Wilde

[time: 5.28-9.96]  This is a LibriVox recording. All LibriVox recordings are in the public domain.

This structured output is immediately suitable as input to a vector embedding model, a summarizer, or any other downstream stage.

A practical use case: searchable meeting archives

A common problem in engineering teams: every all-hands, customer call, and design review gets recorded. The recordings accumulate on Google Drive or S3. Nobody watches them. Nobody can search them. Institutional knowledge is locked inside audio files.

Docling solves the ingestion step. Pair it with a vector store and you have a queryable knowledge base over your entire audio archive.

Standalone transcription script

For a full working example, see the example-docling-media repository, which processes a directory of audio/video files and writes each transcript to a Markdown file.

The core of that project is ~30 lines:

from pathlib import Path

from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline


def main():
    audio_path = Path("videoplayback.mp3")

    pipeline_options = AsrPipelineOptions()
    pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO

    converter = DocumentConverter(
        format_options={
            InputFormat.AUDIO: AudioFormatOption(
                pipeline_cls=AsrPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )

    result = converter.convert(audio_path)
    md = result.document.export_to_markdown()
    Path("transcript.md").write_text(md)
    print(md)


if __name__ == "__main__":
    main()

Building a RAG pipeline with LangChain

Docling integrates with LangChain via DoclingLoader, which wraps DocumentConverter and handles chunking automatically. To build a retrieval pipeline over your audio archive:

from langchain_docling import DoclingLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Load and chunk all audio files in a directory
loader = DoclingLoader("recordings/")
docs = loader.load()

# Embed and index
vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

# Query in natural language
results = retriever.invoke("What did we decide about the auth service in Q3?")

See the LangChain integration guide for more details on DoclingLoader options.

Customizing the ASR model

asr_model_specs.WHISPER_TURBO is the default and recommended starting point — it balances speed and accuracy for most use cases. To use a different model size, pass an alternative spec from docling.datamodel.asr_model_specs:

from docling.datamodel import asr_model_specs

pipeline_options.asr_options = asr_model_specs.WHISPER_LARGE_V3

Available specs depend on your installed version. Check dir(asr_model_specs) for the full list.

Limitations

Limitation	Workaround
No SRT/WebVTT subtitle output	Use `openai-whisper` CLI: `whisper audio.mp3 --output_format srt`
No speaker diarization	Use `pyannote-audio` as a pre- or post-processing step
No word-level timestamps	Not available in current export formats

For knowledge-retrieval use cases (RAG, search, summarization), paragraph-level Markdown is usually all you need. The limitations above matter primarily for subtitle generation workflows.