Document converter
This is an automatic generated API reference of the main components of Docling.
document_converter
Classes:
-
DocumentConverter
– -
ConversionResult
– -
ConversionStatus
– -
FormatOption
– -
InputFormat
–A document format supported by document backend parsers.
-
PdfFormatOption
– -
ImageFormatOption
– -
StandardPdfPipeline
– -
WordFormatOption
– -
PowerpointFormatOption
– -
MarkdownFormatOption
– -
AsciiDocFormatOption
– -
HTMLFormatOption
– -
SimplePipeline
–SimpleModelPipeline.
DocumentConverter
DocumentConverter(allowed_formats: Optional[List[InputFormat]] = None, format_options: Optional[Dict[InputFormat, FormatOption]] = None)
Methods:
-
convert
– -
convert_all
– -
initialize_pipeline
–Initialize the conversion pipeline for the selected format.
Attributes:
-
allowed_formats
– -
format_to_options
– -
initialized_pipelines
(Dict[Tuple[Type[BasePipeline], str], BasePipeline]
) –
allowed_formats
instance-attribute
allowed_formats = allowed_formats if allowed_formats is not None else list(InputFormat)
format_to_options
instance-attribute
format_to_options = {format: _get_default_option(format=format) if (custom_option := get(format)) is None else _lG7dekt9j1xmfor format in allowed_formats}
initialized_pipelines
instance-attribute
initialized_pipelines: Dict[Tuple[Type[BasePipeline], str], BasePipeline] = {}
convert
convert(source: Union[Path, str, DocumentStream], headers: Optional[Dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -> ConversionResult
convert_all
convert_all(source: Iterable[Union[Path, str, DocumentStream]], headers: Optional[Dict[str, str]] = None, raises_on_error: bool = True, max_num_pages: int = maxsize, max_file_size: int = maxsize, page_range: PageRange = DEFAULT_PAGE_RANGE) -> Iterator[ConversionResult]
initialize_pipeline
initialize_pipeline(format: InputFormat)
Initialize the conversion pipeline for the selected format.
ConversionResult
Bases: BaseModel
Attributes:
-
assembled
(AssembledUnit
) – -
confidence
(ConfidenceReport
) – -
document
(DoclingDocument
) – -
errors
(List[ErrorItem]
) – -
input
(InputDocument
) – -
legacy_document
– -
pages
(List[Page]
) – -
status
(ConversionStatus
) – -
timings
(Dict[str, ProfilingItem]
) –
assembled
class-attribute
instance-attribute
assembled: AssembledUnit = AssembledUnit()
confidence
class-attribute
instance-attribute
confidence: ConfidenceReport = Field(default_factory=ConfidenceReport)
errors
class-attribute
instance-attribute
errors: List[ErrorItem] = []
input
instance-attribute
input: InputDocument
legacy_document
property
legacy_document
pages
class-attribute
instance-attribute
pages: List[Page] = []
timings
class-attribute
instance-attribute
timings: Dict[str, ProfilingItem] = {}
ConversionStatus
Bases: str
, Enum
Attributes:
FAILURE
class-attribute
instance-attribute
FAILURE = 'failure'
PARTIAL_SUCCESS
class-attribute
instance-attribute
PARTIAL_SUCCESS = 'partial_success'
PENDING
class-attribute
instance-attribute
PENDING = 'pending'
SKIPPED
class-attribute
instance-attribute
SKIPPED = 'skipped'
STARTED
class-attribute
instance-attribute
STARTED = 'started'
SUCCESS
class-attribute
instance-attribute
SUCCESS = 'success'
FormatOption
Bases: BaseModel
Methods:
Attributes:
-
backend
(Type[AbstractDocumentBackend]
) – -
model_config
– -
pipeline_cls
(Type[BasePipeline]
) – -
pipeline_options
(Optional[PipelineOptions]
) –
backend
instance-attribute
backend: Type[AbstractDocumentBackend]
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_cls
instance-attribute
pipeline_cls: Type[BasePipeline]
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
InputFormat
Bases: str
, Enum
A document format supported by document backend parsers.
Attributes:
-
ASCIIDOC
– -
CSV
– -
DOCX
– -
HTML
– -
IMAGE
– -
JSON_DOCLING
– -
MD
– -
PDF
– -
PPTX
– -
XLSX
– -
XML_JATS
– -
XML_USPTO
–
ASCIIDOC
class-attribute
instance-attribute
ASCIIDOC = 'asciidoc'
CSV
class-attribute
instance-attribute
CSV = 'csv'
DOCX
class-attribute
instance-attribute
DOCX = 'docx'
HTML
class-attribute
instance-attribute
HTML = 'html'
IMAGE
class-attribute
instance-attribute
IMAGE = 'image'
JSON_DOCLING
class-attribute
instance-attribute
JSON_DOCLING = 'json_docling'
MD
class-attribute
instance-attribute
MD = 'md'
PDF
class-attribute
instance-attribute
PDF = 'pdf'
PPTX
class-attribute
instance-attribute
PPTX = 'pptx'
XLSX
class-attribute
instance-attribute
XLSX = 'xlsx'
XML_JATS
class-attribute
instance-attribute
XML_JATS = 'xml_jats'
XML_USPTO
class-attribute
instance-attribute
XML_USPTO = 'xml_uspto'
PdfFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend
(Type[AbstractDocumentBackend]
) – -
model_config
– -
pipeline_cls
(Type
) – -
pipeline_options
(Optional[PipelineOptions]
) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
ImageFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend
(Type[AbstractDocumentBackend]
) – -
model_config
– -
pipeline_cls
(Type
) – -
pipeline_options
(Optional[PipelineOptions]
) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = DoclingParseV4DocumentBackend
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
StandardPdfPipeline
StandardPdfPipeline(pipeline_options: PdfPipelineOptions)
Bases: PaginatedPipeline
Methods:
-
download_models_hf
– -
execute
– -
get_default_options
– -
get_ocr_model
– -
get_picture_description_model
– -
initialize_page
– -
is_backend_supported
–
Attributes:
-
build_pipe
– -
enrichment_pipe
– -
keep_backend
– -
keep_images
– -
pipeline_options
(PdfPipelineOptions
) – -
reading_order_model
–
build_pipe
instance-attribute
build_pipe = [PagePreprocessingModel(options=PagePreprocessingOptions(images_scale=images_scale, create_parsed_page=generate_parsed_pages)), ocr_model, LayoutModel(artifacts_path=artifacts_path, accelerator_options=accelerator_options), TableStructureModel(enabled=do_table_structure, artifacts_path=artifacts_path, options=table_structure_options, accelerator_options=accelerator_options), PageAssembleModel(options=PageAssembleOptions())]
enrichment_pipe
instance-attribute
enrichment_pipe = [CodeFormulaModel(enabled=do_code_enrichment or do_formula_enrichment, artifacts_path=artifacts_path, options=CodeFormulaModelOptions(do_code_enrichment=do_code_enrichment, do_formula_enrichment=do_formula_enrichment), accelerator_options=accelerator_options), DocumentPictureClassifier(enabled=do_picture_classification, artifacts_path=artifacts_path, options=DocumentPictureClassifierOptions(), accelerator_options=accelerator_options), picture_description_model]
keep_backend
instance-attribute
keep_backend = True
keep_images
instance-attribute
keep_images = generate_page_images or generate_picture_images or generate_table_images
reading_order_model
instance-attribute
reading_order_model = ReadingOrderModel(options=ReadingOrderOptions())
download_models_hf
staticmethod
download_models_hf(local_dir: Optional[Path] = None, force: bool = False) -> Path
get_ocr_model
get_ocr_model(artifacts_path: Optional[Path] = None) -> BaseOcrModel
get_picture_description_model
get_picture_description_model(artifacts_path: Optional[Path] = None) -> Optional[PictureDescriptionBaseModel]
is_backend_supported
classmethod
is_backend_supported(backend: AbstractDocumentBackend)
WordFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend
(Type[AbstractDocumentBackend]
) – -
model_config
– -
pipeline_cls
(Type
) – -
pipeline_options
(Optional[PipelineOptions]
) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = MsWordDocumentBackend
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
PowerpointFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend
(Type[AbstractDocumentBackend]
) – -
model_config
– -
pipeline_cls
(Type
) – -
pipeline_options
(Optional[PipelineOptions]
) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = MsPowerpointDocumentBackend
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
MarkdownFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend
(Type[AbstractDocumentBackend]
) – -
model_config
– -
pipeline_cls
(Type
) – -
pipeline_options
(Optional[PipelineOptions]
) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = MarkdownDocumentBackend
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
AsciiDocFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend
(Type[AbstractDocumentBackend]
) – -
model_config
– -
pipeline_cls
(Type
) – -
pipeline_options
(Optional[PipelineOptions]
) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = AsciiDocBackend
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
HTMLFormatOption
Bases: FormatOption
Methods:
Attributes:
-
backend
(Type[AbstractDocumentBackend]
) – -
model_config
– -
pipeline_cls
(Type
) – -
pipeline_options
(Optional[PipelineOptions]
) –
backend
class-attribute
instance-attribute
backend: Type[AbstractDocumentBackend] = HTMLDocumentBackend
model_config
class-attribute
instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
pipeline_options
class-attribute
instance-attribute
pipeline_options: Optional[PipelineOptions] = None
SimplePipeline
SimplePipeline(pipeline_options: PipelineOptions)
Bases: BasePipeline
SimpleModelPipeline.
This class is used at the moment for formats / backends which produce straight DoclingDocument output.
Methods:
Attributes:
-
build_pipe
(List[Callable]
) – -
enrichment_pipe
(List[GenericEnrichmentModel[Any]]
) – -
keep_images
– -
pipeline_options
–
build_pipe
instance-attribute
build_pipe: List[Callable] = []
enrichment_pipe
instance-attribute
enrichment_pipe: List[GenericEnrichmentModel[Any]] = []
keep_images
instance-attribute
keep_images = False
pipeline_options
instance-attribute
pipeline_options = pipeline_options
is_backend_supported
classmethod
is_backend_supported(backend: AbstractDocumentBackend)