Rise of the ChatBots (3) - They can see!

By Charles LAZIOSI

Published on: December 6, 2023

Sharing

Introduction

GPT-4 with Vision, also known as GPT-4V or gpt-4-vision-preview in API parlance, extends the capabilities of traditional language models by incorporating image processing. This feature moves beyond the text-only limitations of previous models, opening up new possibilities for applications that GPT-4 can be harnessed for.

Developers who already have access to GPT-4 can utilize the gpt-4-vision-preview model through the updated Chat Completions API, which now accepts images as input. However, it's worth noting that the Assistants API doesn't yet accommodate image inputs.

Key points to remember include:

The behavior of GPT-4 Turbo with Vision might differ slightly from the non-vision version due to an automatic system message included in conversations.
Despite this distinction, GPT-4 Turbo with Vision retains all the capabilities of the standard GPT-4 Turbo preview model while adding vision functionality.
The vision feature is just one aspect of the diverse abilities this model possesses.

ChatGPT Vision - few lines and they can see

The model can process images provided either via a URL or as a base64 encoded string in the user, system, and assistant messages. However, images are not supported in the initial system message, although this may change in the future.

The model excels at general questions about objects within images but struggles with detailed spatial queries, like pinpointing an object's location within a scene. Users should be aware of the model's limitations (limitations of the model) when considering applications for visual understanding.

Additionally, the Chat Completions API can handle multiple images and will integrate information from all provided images to address the posed question.

The code below illustrates these capabilities. You can still run the full project on my GitHub workspace

import base64
from openai import OpenAI

client = OpenAI()

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')
  
# Path to your image
image_path = "./media/image001.jpeg"

# Getting the base64 string
base64_image = encode_image(image_path)


response = client.chat.completions.create(
  model="gpt-4-vision-preview",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this three images?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          }
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0])

Result:

For this picture, please find the description done by ChatGPT Vision API

The second image is a digital artwork or a manipulated photo that presents a whimsical and surreal scene. It depicts an elephant being carried by a hot air balloon over a lush green hill, with a single tree standing on top of the hill and a flock of birds flying in the background. The sky is dramatic with storm clouds gathering, suggesting a contrast between the light-hearted, fantastical element of the floating elephant and the impending storm. This image does not represent a real-life scenario; it is created for artistic or conceptual purposes.