IronaAI supports multimodal inputs, allowing you to include images and PDFs in the messages sent to the language model. This feature enables the model to process and respond to visual content, opening up possibilities for tasks like image analysis, document understanding, and more.

When to use multimodal inputs

Use multimodal inputs when you need the language model to process and respond to visual content, such as:

  • Analyzing images
  • Extracting information from PDFs
  • Generating descriptions or summaries of visual content
  • Answering questions about images or documents

Including images in messages

To include an image in your message, use the 'image_url' type within the content list of a message. You can provide either a direct URL to the image or a base64-encoded image string with the appropriate MIME type (e.g., 'data:image/jpeg;base64,...').

Using an image URL

response = client.completions.create(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": "https://example.com/image.jpg"
                }
            ]
        }
    ],
    model="openai/gpt-4o"
)
print(response.choices[0].message.content)

Using a base64-encoded image

import base64

with open("image.jpg", "rb") as image_file:
    encoded_image = base64.b64encode(image_file.read()).decode("utf-8")

response = client.completions.create(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": f"data:image/jpeg;base64,{encoded_image}"
                }
            ]
        }
    ],
    model="openai/gpt-4o"
)
print(response.choices[0].message.content)

Including PDFs in messages

Similarly, you can include PDFs using the 'image_url' type with either a PDF URL or a base64-encoded PDF string with the MIME type 'application/pdf'.

Using a PDF URL

response = client.completions.create(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this PDF?"},
                {
                    "type": "image_url",
                    "image_url": "https://example.com/document.pdf"
                }
            ]
        }
    ],
    model="openai/gpt-4o"
)
print(response.choices[0].message.content)

Using a base64-encoded PDF

import base64

with open("document.pdf", "rb") as pdf_file:
    encoded_pdf = base64.b64encode(pdf_file.read()).decode("utf-8")

response = client.completions.create(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this PDF?"},
                {
                    "type": "image_url",
                    "image_url": f"data:application/pdf;base64,{encoded_pdf}"
                }
            ]
        }
    ],
    model="openai/gpt-4o"
)
print(response.choices[0].message.content)

Model support and selection

Not all language models support multimodal inputs like images and PDFs. IronaAI maintains an internal list of models that can handle these media types (e.g., “openai/gpt-4o”, “anthropic/claude-3-5-haiku-20241022”). When you provide a model_list or rely on the default model list, IronaAI automatically selects a model that supports the media types present in your messages.

If you specify a specific model, ensure it supports the required media types. You can verify a model’s capabilities using helper functions:

model = "openai/gpt-4o"

print(f"Supports vision: {client.supports_vision(model)}")  # True
print(f"Supports PDFs: {client.supports_pdf_input(model)}")  # True

Note: For OpenAI models, IronaAI includes special handling for PDFs via the openai_completion_with_pdf function. For other models, support depends on their native capabilities as defined in IronaAI’s support_media_dict.