Skip to content

Vlm pipeline api model

Use the VLM pipeline with remote API models (LM Studio, Ollama, VLLM, watsonx.ai).

What this example does - Demonstrates using presets with API runtimes (LM Studio, Ollama, VLLM, watsonx.ai) - Shows that API is just a runtime choice, not a different options class - Explains pre-configured API types and custom API configuration

Prerequisites - Install Docling with VLM extras and python-dotenv if using environment files. - For local APIs: run LM Studio, Ollama, or VLLM locally. - For cloud APIs: set required environment variables (see watsonx.ai example).

How to run - From the repo root: python docs/examples/vlm_pipeline_api_model.py. - Each example checks its own prerequisites and skips if not available.

Notes - The NEW runtime system unifies API and local inference - For legacy approach, see legacy examples in docs/examples/legacy/

import logging
import os
from pathlib import Path

import requests
from dotenv import load_dotenv

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    VlmConvertOptions,
    VlmPipelineOptions,
)
from docling.datamodel.vlm_engine_options import (
    ApiVlmEngineOptions,
    VlmEngineType,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline


def check_and_load_lmstudio_model(model_name: str) -> bool:
    """Check if model is loaded in LM Studio and attempt to load if not.

    Args:
        model_name: The model name to check/load

    Returns:
        True if model is loaded or successfully loaded, False otherwise
    """
    try:
        # Check if model is already loaded
        response = requests.get("http://localhost:1234/v1/models", timeout=2)
        if response.status_code == 200:
            models = response.json().get("data", [])
            loaded_models = [m.get("id") for m in models]
            if model_name in loaded_models:
                print(f"✓ Model '{model_name}' is already loaded in LM Studio")
                return True

            # Try to load the model using LM Studio API
            print(f"Attempting to load model '{model_name}' in LM Studio...")

            load_response = requests.post(
                "http://localhost:1234/api/v1/models/load",
                headers={"Content-Type": "application/json"},
                json={
                    "model": model_name,
                },
                timeout=60,
            )

            if load_response.status_code == 200:
                print(f"✓ Successfully loaded model '{model_name}'")
                return True
            else:
                print(f"✗ Failed to load model: HTTP {load_response.status_code}")
                print("  Please load the model manually in LM Studio:")
                print(f"    lms load {model_name}")
                return False
        return False
    except requests.exceptions.Timeout:
        print("✗ Timeout while trying to load model")
        return False
    except Exception as e:
        print(f"✗ Error checking/loading model: {e}")
        return False


def check_and_pull_ollama_model(model_name: str) -> bool:
    """Check if model exists in Ollama and attempt to pull if not.

    Args:
        model_name: The model name to check/pull

    Returns:
        True if model exists or successfully pulled, False otherwise
    """
    try:
        # Check if model exists
        response = requests.get("http://localhost:11434/api/tags", timeout=2)
        if response.status_code == 200:
            models = response.json().get("models", [])
            model_names = [m.get("name") for m in models]
            # Check for exact match or with :latest tag
            if model_name in model_names or f"{model_name}:latest" in model_names:
                print(f"✓ Model '{model_name}' is already available in Ollama")
                return True

            # Try to pull the model using Ollama API
            print(f"Attempting to pull model '{model_name}' in Ollama...")
            print("This may take a few minutes...")

            # Ollama pull API endpoint
            pull_response = requests.post(
                "http://localhost:11434/api/pull",
                json={"name": model_name},
                stream=True,
                timeout=300,
            )

            if pull_response.status_code == 200:
                # Stream the response to show progress
                for line in pull_response.iter_lines():
                    if line:
                        import json

                        try:
                            data = json.loads(line)
                            status = data.get("status", "")
                            if status:
                                print(f"  {status}", end="\r")
                        except json.JSONDecodeError:
                            pass
                print()  # New line after progress
                print(f"✓ Successfully pulled model '{model_name}'")
                return True
            else:
                print(f"✗ Failed to pull model: HTTP {pull_response.status_code}")
                return False
        return False
    except requests.exceptions.Timeout:
        print("✗ Timeout while trying to pull model (this can take a while)")
        print("Please try pulling manually: ollama pull", model_name)
        return False
    except Exception as e:
        print(f"✗ Error checking/pulling model: {e}")
        return False


def run_lmstudio_example(input_doc_path: Path) -> bool:
    """Example 1: Using Granite-Docling preset with LM Studio API runtime.

    Returns:
        True if example ran successfully, False if skipped
    """
    print("=" * 70)
    print("Example 1: Granite-Docling with LM Studio (pre-configured API type)")
    print("=" * 70)
    print("\nPrerequisites:")
    print("- Start LM Studio: lms server start")
    print("- Model will be loaded automatically if not already loaded")
    print("  (or manually: lms load granite-docling-258m-mlx)")
    print()

    # Check if LM Studio is running
    try:
        response = requests.get("http://localhost:1234/v1/models", timeout=2)
        if response.status_code != 200:
            print("WARNING: LM Studio server not responding correctly")
            print("Skipping LM Studio example.\n")
            return False
    except requests.exceptions.RequestException:
        print("WARNING: LM Studio server not running at http://localhost:1234")
        print("Skipping LM Studio example.\n")
        return False

    # Check and load the model
    # Note: LM Studio uses a different model ID than the HuggingFace repo
    model_name = "granite-docling-258m-mlx"
    if not check_and_load_lmstudio_model(model_name):
        print("Skipping LM Studio example.\n")
        return False

    # Use granite_docling preset with LM Studio API runtime
    # The preset is pre-configured for LM Studio API type
    vlm_options = VlmConvertOptions.from_preset(
        "granite_docling",
        engine_options=ApiVlmEngineOptions(
            runtime_type=VlmEngineType.API_LMSTUDIO,
            # url is pre-configured for LM Studio (http://localhost:1234/v1/chat/completions)
            # model name is pre-configured from the preset
            timeout=90,
        ),
    )

    pipeline_options = VlmPipelineOptions(
        vlm_options=vlm_options,
        enable_remote_services=True,  # Required for API runtimes
    )

    print("\nOther API types are also pre-configured:")
    print("- VlmEngineType.API_OLLAMA: http://localhost:11434/v1/chat/completions")
    print("- VlmEngineType.API_OPENAI: https://api.openai.com/v1/chat/completions")
    print("- VlmEngineType.API: Generic API endpoint (you specify the URL)")
    print("\nEach preset has pre-configured model names for these API types.\n")

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                pipeline_cls=VlmPipeline,
            )
        }
    )

    result = doc_converter.convert(input_doc_path)
    print(result.document.export_to_markdown())
    return True


def run_ollama_example(input_doc_path: Path) -> bool:
    """Example 2: Using Granite-Docling preset with Ollama.

    Returns:
        True if example ran successfully, False if skipped
    """
    print("\n" + "=" * 70)
    print("Example 2: Granite-Docling with Ollama (pre-configured API type)")
    print("=" * 70)
    print("\nPrerequisites:")
    print("- Install Ollama: https://ollama.ai")
    print("- Pull model: ollama pull ibm/granite-docling:258m")
    print()

    # Check if Ollama is running
    try:
        response = requests.get("http://localhost:11434/api/tags", timeout=2)
        if response.status_code != 200:
            print("WARNING: Ollama server not responding correctly")
            print("Skipping Ollama example.\n")
            return False
    except requests.exceptions.RequestException:
        print("WARNING: Ollama server not running at http://localhost:11434")
        print("Skipping Ollama example.\n")
        return False

    # Check and pull the model
    model_name = "ibm/granite-docling:258m"
    if not check_and_pull_ollama_model(model_name):
        print("Skipping Ollama example.\n")
        return False

    # Use granite_docling preset with Ollama API runtime
    vlm_options = VlmConvertOptions.from_preset(
        "granite_docling",
        engine_options=ApiVlmEngineOptions(
            runtime_type=VlmEngineType.API_OLLAMA,
            # url is pre-configured for Ollama (http://localhost:11434/v1/chat/completions)
            # model name is pre-configured from the preset
            timeout=90,
        ),
    )

    pipeline_options = VlmPipelineOptions(
        vlm_options=vlm_options,
        enable_remote_services=True,
    )

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                pipeline_cls=VlmPipeline,
            )
        }
    )

    result = doc_converter.convert(input_doc_path)
    print(result.document.export_to_markdown())
    return True


def run_vllm_example(input_doc_path: Path) -> bool:
    """Example 3: Using Granite-Docling preset with VLLM server.

    Returns:
        True if example ran successfully, False if skipped
    """
    print("\n" + "=" * 70)
    print("Example 3: Granite-Docling with VLLM (generic API configuration)")
    print("=" * 70)
    print("\nPrerequisites:")
    print("- Start VLLM server:")
    print("  vllm serve ibm-granite/granite-docling-258M --revision untied")
    print()

    # Check if VLLM is running
    try:
        response = requests.get("http://localhost:8000/v1/models", timeout=2)
        if response.status_code != 200:
            print("WARNING: VLLM server not responding correctly")
            print("Skipping VLLM example.\n")
            return False
    except requests.exceptions.RequestException:
        print("WARNING: VLLM server not running at http://localhost:8000")
        print("Skipping VLLM example.\n")
        return False

    # Use granite_docling preset with generic API runtime
    # For VLLM, we need to provide custom URL and params
    vlm_options = VlmConvertOptions.from_preset(
        "granite_docling",
        engine_options=ApiVlmEngineOptions(
            runtime_type=VlmEngineType.API,  # Generic API type
            url="http://localhost:8000/v1/chat/completions",
            params={
                "model": "ibm-granite/granite-docling-258M",
                "temperature": 0.0,
                "max_tokens": 8192,
                "skip_special_tokens": False,
            },
            timeout=90,
        ),
    )

    pipeline_options = VlmPipelineOptions(
        vlm_options=vlm_options,
        enable_remote_services=True,
    )

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                pipeline_cls=VlmPipeline,
            )
        }
    )

    result = doc_converter.convert(input_doc_path)
    print(result.document.export_to_markdown())
    return True


def run_watsonx_example(input_doc_path: Path) -> bool:
    """Example 4: Using preset with watsonx.ai (custom API configuration).

    Returns:
        True if example ran successfully, False if skipped
    """
    print("\n" + "=" * 70)
    print("Example 4: Granite-Docling with watsonx.ai (custom API configuration)")
    print("=" * 70)

    # Check if running in CI environment
    if os.environ.get("CI"):
        print("Skipping watsonx.ai example in CI environment")
        return False

    # Load environment variables
    load_dotenv()
    api_key = os.environ.get("WX_API_KEY")
    project_id = os.environ.get("WX_PROJECT_ID")

    # Check if credentials are available
    if not api_key or not project_id:
        print("WARNING: watsonx.ai credentials not found.")
        print(
            "Set WX_API_KEY and WX_PROJECT_ID environment variables to run this example."
        )
        print("Skipping watsonx.ai example.\n")
        return False

    def _get_iam_access_token(api_key: str) -> str:
        res = requests.post(
            url="https://iam.cloud.ibm.com/identity/token",
            headers={"Content-Type": "application/x-www-form-urlencoded"},
            data=f"grant_type=urn:ibm:params:oauth:grant-type:apikey&apikey={api_key}",
        )
        res.raise_for_status()
        return res.json()["access_token"]

    print("\nNote: Granite-Docling models are not currently available on watsonx.ai")
    print("Using Llama 3.2 Vision model instead")
    print("The preset still provides the prompt and response format configuration\n")

    # Use granite_docling preset but override the model for watsonx.ai
    vlm_options = VlmConvertOptions.from_preset(
        "granite_docling",
        engine_options=ApiVlmEngineOptions(
            runtime_type=VlmEngineType.API,  # Generic API type
            url="https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29",
            headers={
                "Authorization": "Bearer " + _get_iam_access_token(api_key=api_key),
            },
            params={
                "model_id": "meta-llama/llama-3-2-11b-vision-instruct",
                "project_id": project_id,
                "parameters": {"max_new_tokens": 4096},
            },
            timeout=60,
        ),
    )

    pipeline_options = VlmPipelineOptions(
        vlm_options=vlm_options,
        enable_remote_services=True,
    )

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                pipeline_cls=VlmPipeline,
            )
        }
    )

    result = doc_converter.convert(input_doc_path)
    print(result.document.export_to_markdown())
    return True


def main():
    logging.basicConfig(level=logging.INFO)

    data_folder = Path(__file__).parent / "../../tests/data"
    input_doc_path = data_folder / "pdf/2305.03393v1-pg9.pdf"

    # Track which examples ran
    results = {
        "LM Studio": run_lmstudio_example(input_doc_path),
        "Ollama": run_ollama_example(input_doc_path),
        "VLLM": run_vllm_example(input_doc_path),
        "watsonx.ai": run_watsonx_example(input_doc_path),
    }

    # Print summary
    print("\n" + "=" * 70)
    print("SUMMARY")
    print("=" * 70)

    ran = [name for name, success in results.items() if success]
    skipped = [name for name, success in results.items() if not success]

    if ran:
        print(f"\n✓ Examples that ran successfully ({len(ran)}):")
        for name in ran:
            print(f"  - {name}")

    if skipped:
        print(f"\n⊘ Examples that were skipped ({len(skipped)}):")
        for name in skipped:
            reason = "Server not running"
            if name == "watsonx.ai":
                if os.environ.get("CI"):
                    reason = "Running in CI environment"
                else:
                    reason = "Credentials not found (WX_API_KEY, WX_PROJECT_ID)"
            print(f"  - {name}: {reason}")

    print()


if __name__ == "__main__":
    main()

Key Concepts

Pre-configured API Types

The new runtime system has pre-configured API types: - API_OLLAMA: Ollama server (port 11434) - API_LMSTUDIO: LM Studio server (port 1234) - API_OPENAI: OpenAI API - API: Generic API endpoint (you provide URL)

Each preset knows the appropriate model names for these API types.

Custom API Configuration

For services like watsonx.ai that need custom configuration: - Use VlmEngineType.API (generic) - Provide custom url, headers, and params - The preset still provides the base model configuration (prompt, response format)

Same Preset, Different Runtime

You can use the same preset (e.g., "granite_docling") with: - Local Transformers runtime (see other examples) - Local MLX runtime (macOS) - LM Studio API runtime (this example) - Ollama API runtime (this example) - VLLM API runtime (this example) - watsonx.ai API runtime (this example) - Any other API endpoint

This makes it easy to develop locally and deploy to production!

Available Presets for VlmConvert

  • granite_docling: IBM Granite Docling 258M (DocTags format)
  • smoldocling: SmolDocling 256M (DocTags format)
  • deepseek_ocr: DeepSeek OCR (Markdown format)
  • granite_vision: IBM Granite Vision (Markdown format)
  • pixtral: Pixtral (Markdown format)
  • got_ocr: GOT-OCR (Markdown format)
  • phi4: Phi-4 (Markdown format)
  • qwen: Qwen (Markdown format)
  • nanonets_ocr2: Nanonets OCR2 (Markdown format)
  • gemma_12b: Gemma 12B (Markdown format)
  • gemma_27b: Gemma 27B (Markdown format)
  • dolphin: Dolphin (Markdown format)