Document converter
This is an automatic generated API reference of the main components of Docling.
document_converter
Classes:
-
DocumentConverter– -
ConversionResult– -
ConversionStatus– -
FormatOption– -
InputFormat–A document format supported by document backend parsers.
-
PdfFormatOption– -
ImageFormatOption– -
StandardPdfPipeline– -
WordFormatOption– -
PowerpointFormatOption– -
MarkdownFormatOption– -
AsciiDocFormatOption– -
HTMLFormatOption– -
SimplePipeline–SimpleModelPipeline.
DocumentConverter
DocumentConverter(allowed_formats: Optional[list[InputFormat]] = None, format_options: Optional[dict[InputFormat, FormatOption]] = None)
Methods:
-
convert– -
convert_all– -
convert_string– -
initialize_pipeline–Initialize the conversion pipeline for the selected format.
Attributes:
-
allowed_formats– -
format_to_options(dict[InputFormat, FormatOption]) – -
initialized_pipelines(dict[tuple[Type[BasePipeline], str], BasePipeline]) –
allowed_formats
instance-attribute
allowed_formats = allowed_formats if allowed_formats is not None else list(InputFormat)
format_to_options
instance-attribute
format_to_options: dict[InputFormat, FormatOption] = {format: (_get_default_option(format=format) if (custom_option := (get(format))) is None else custom_option)for format in (allowed_formats)}
initialized_pipelines
instance-attribute
initialized_pipelines: dict[tuple[Type[BasePipeline], str], BasePipeline] = {}
convert
convert(source: Union[Path, str, DocumentStream], headers: Optional[dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -> ConversionResult
convert_all
convert_all(source: Iterable[Union[Path, str, DocumentStream]], headers: Optional[dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -> Iterator[ConversionResult]
convert_string
convert_string(content: str, format: InputFormat, name: Optional[str]) -> ConversionResult
initialize_pipeline
initialize_pipeline(format: InputFormat)
Initialize the conversion pipeline for the selected format.
ConversionResult
Bases: BaseModel
Attributes:
-
assembled(AssembledUnit) – -
confidence(ConfidenceReport) – -
document(DoclingDocument) – -
errors(list[ErrorItem]) – -
input(InputDocument) – -
legacy_document– -
pages(list[Page]) – -
status(ConversionStatus) – -
timings(dict[str, ProfilingItem]) –
assembled
class-attribute
instance-attribute
assembled: AssembledUnit = AssembledUnit()
confidence
class-attribute
instance-attribute
confidence: ConfidenceReport = Field(default_factory=ConfidenceReport)
errors
class-attribute
instance-attribute
errors: list[ErrorItem] = []
input
instance-attribute
input: InputDocument
legacy_document
property
legacy_document
pages
class-attribute
instance-attribute
pages: list[Page] = []
timings
class-attribute
instance-attribute
timings: dict[str, ProfilingItem] = {}
ConversionStatus
Bases: str, Enum
Attributes:
FAILURE
class-attribute
instance-attribute
FAILURE = 'failure'
PARTIAL_SUCCESS
class-attribute
instance-attribute
PARTIAL_SUCCESS = 'partial_success'
PENDING
class-attribute
instance-attribute
PENDING = 'pending'
SKIPPED
class-attribute
instance-attribute
SKIPPED = 'skipped'
STARTED
class-attribute
instance-attribute
STARTED = 'started'
SUCCESS
class-attribute
instance-attribute
SUCCESS = 'success'
FormatOption
Bases: BaseFormatOption
Methods:
Attributes:
-
backend(Type[AbstractDocumentBackend]) – -
backend_options(Optional[BackendOptions]) – -
model_config– -
pipeline_cls(Type[BasePipeline]) – -
pipeline_options(Optional[PipelineOptions]) –
backend
instance-attribute
backend: Type[AbstractDocumentBackend]
backend_options
class-attribute
instance-attribute
backend_options: Optional[BackendOptions] = None
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_cls
instance-attribute
pipeline_cls: Type[BasePipeline]
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
set_optional_field_default
set_optional_field_default() -> Self
InputFormat
Bases: str, Enum
A document format supported by document backend parsers.
Attributes:
-
ASCIIDOC– -
AUDIO– -
CSV– -
DOCX– -
HTML– -
IMAGE– -
JSON_DOCLING– -
MD– -
METS_GBS– -
PDF– -
PPTX– -
VTT– -
XLSX– -
XML_JATS– -
XML_USPTO–
ASCIIDOC
class-attribute
instance-attribute
ASCIIDOC = 'asciidoc'
AUDIO
class-attribute
instance-attribute
AUDIO = 'audio'
CSV
class-attribute
instance-attribute
CSV = 'csv'
DOCX
class-attribute
instance-attribute
DOCX = 'docx'
HTML
class-attribute
instance-attribute
HTML = 'html'
IMAGE
class-attribute
instance-attribute
IMAGE = 'image'
JSON_DOCLING
class-attribute
instance-attribute
JSON_DOCLING = 'json_docling'
MD
class-attribute
instance-attribute
MD = 'md'
METS_GBS
class-attribute
instance-attribute
METS_GBS = 'mets_gbs'
PDF
class-attribute
instance-attribute
PDF = 'pdf'
PPTX
class-attribute
instance-attribute
PPTX = 'pptx'
VTT
class-attribute
instance-attribute
VTT = 'vtt'
XLSX
class-attribute
instance-attribute
XLSX = 'xlsx'
XML_JATS
class-attribute
instance-attribute
XML_JATS = 'xml_jats'
XML_USPTO
class-attribute
instance-attribute
XML_USPTO = 'xml_uspto'
PdfFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend(Type[AbstractDocumentBackend]) – -
backend_options(Optional[PdfBackendOptions]) – -
model_config– -
pipeline_cls(Type) – -
pipeline_options(Optional[PipelineOptions]) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend
backend_options
class-attribute
instance-attribute
backend_options: Optional[PdfBackendOptions] = None
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
set_optional_field_default
set_optional_field_default() -> Self
ImageFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend(Type[AbstractDocumentBackend]) – -
backend_options(Optional[BackendOptions]) – -
model_config– -
pipeline_cls(Type) – -
pipeline_options(Optional[PipelineOptions]) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend
backend_options
class-attribute
instance-attribute
backend_options: Optional[BackendOptions] = None
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
set_optional_field_default
set_optional_field_default() -> Self
StandardPdfPipeline
StandardPdfPipeline(pipeline_options: PdfPipelineOptions)
Bases: PaginatedPipeline
Methods:
-
download_models_hf– -
execute– -
get_default_options– -
get_ocr_model– -
initialize_page– -
is_backend_supported–
Attributes:
-
artifacts_path(Optional[Path]) – -
build_pipe– -
enrichment_pipe– -
keep_backend– -
keep_images– -
pipeline_options(PdfPipelineOptions) – -
reading_order_model–
artifacts_path
instance-attribute
artifacts_path: Optional[Path] = None
build_pipe
instance-attribute
build_pipe = [PagePreprocessingModel(options=PagePreprocessingOptions(images_scale=images_scale)), ocr_model, LayoutModel(artifacts_path=artifacts_path, accelerator_options=accelerator_options, options=layout_options), TableStructureModel(enabled=do_table_structure, artifacts_path=artifacts_path, options=table_structure_options, accelerator_options=accelerator_options), PageAssembleModel(options=PageAssembleOptions())]
enrichment_pipe
instance-attribute
enrichment_pipe = [CodeFormulaModel(enabled=do_code_enrichment or do_formula_enrichment, artifacts_path=artifacts_path, options=CodeFormulaModelOptions(do_code_enrichment=do_code_enrichment, do_formula_enrichment=do_formula_enrichment), accelerator_options=accelerator_options), *(enrichment_pipe)]
keep_backend
instance-attribute
keep_backend = True
keep_images
instance-attribute
keep_images = generate_page_images or generate_picture_images or generate_table_images
reading_order_model
instance-attribute
reading_order_model = ReadingOrderModel(options=ReadingOrderOptions())
download_models_hf
staticmethod
download_models_hf(local_dir: Optional[Path] = None, force: bool = False) -> Path
get_ocr_model
get_ocr_model(artifacts_path: Optional[Path] = None) -> BaseOcrModel
is_backend_supported
classmethod
is_backend_supported(backend: AbstractDocumentBackend)
WordFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend(Type[AbstractDocumentBackend]) – -
backend_options(Optional[BackendOptions]) – -
model_config– -
pipeline_cls(Type) – -
pipeline_options(Optional[PipelineOptions]) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = MsWordDocumentBackend
backend_options
class-attribute
instance-attribute
backend_options: Optional[BackendOptions] = None
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
set_optional_field_default
set_optional_field_default() -> Self
PowerpointFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend(Type[AbstractDocumentBackend]) – -
backend_options(Optional[BackendOptions]) – -
model_config– -
pipeline_cls(Type) – -
pipeline_options(Optional[PipelineOptions]) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = MsPowerpointDocumentBackend
backend_options
class-attribute
instance-attribute
backend_options: Optional[BackendOptions] = None
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
set_optional_field_default
set_optional_field_default() -> Self
MarkdownFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend(Type[AbstractDocumentBackend]) – -
backend_options(Optional[MarkdownBackendOptions]) – -
model_config– -
pipeline_cls(Type) – -
pipeline_options(Optional[PipelineOptions]) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = MarkdownDocumentBackend
backend_options
class-attribute
instance-attribute
backend_options: Optional[MarkdownBackendOptions] = None
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
set_optional_field_default
set_optional_field_default() -> Self
AsciiDocFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend(Type[AbstractDocumentBackend]) – -
backend_options(Optional[BackendOptions]) – -
model_config– -
pipeline_cls(Type) – -
pipeline_options(Optional[PipelineOptions]) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = AsciiDocBackend
backend_options
class-attribute
instance-attribute
backend_options: Optional[BackendOptions] = None
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
set_optional_field_default
set_optional_field_default() -> Self
HTMLFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend(Type[AbstractDocumentBackend]) – -
backend_options(Optional[HTMLBackendOptions]) – -
model_config– -
pipeline_cls(Type) – -
pipeline_options(Optional[PipelineOptions]) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = HTMLDocumentBackend
backend_options
class-attribute
instance-attribute
backend_options: Optional[HTMLBackendOptions] = None
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
set_optional_field_default
set_optional_field_default() -> Self
SimplePipeline
SimplePipeline(pipeline_options: ConvertPipelineOptions)
Bases: ConvertPipeline
SimpleModelPipeline.
This class is used at the moment for formats / backends which produce straight DoclingDocument output.
Methods:
Attributes:
-
artifacts_path(Optional[Path]) – -
build_pipe(List[Callable]) – -
enrichment_pipe– -
keep_images– -
pipeline_options(ConvertPipelineOptions) –
artifacts_path
instance-attribute
artifacts_path: Optional[Path] = None
build_pipe
instance-attribute
build_pipe: List[Callable] = []
enrichment_pipe
instance-attribute
enrichment_pipe = [DocumentPictureClassifier(enabled=do_picture_classification, artifacts_path=artifacts_path, options=DocumentPictureClassifierOptions(), accelerator_options=accelerator_options), picture_description_model]
keep_images
instance-attribute
keep_images = False
is_backend_supported
classmethod
is_backend_supported(backend: AbstractDocumentBackend)