Naveen Krishnan

Generative AI

Image Source: https://unsplash.com/

Introduction

One common misconception around AI-based document analysis is that complex PDF files, especially those containing images, can only be efficiently processed by specific multimodal models, like Gemini or Claude. While these models are powerful, they’re often not the only choice — GPT-4o, OpenAI’s Azure-hosted language model, can also perform robust document analysis and summarization directly from PDF content.

In this blog, we’ll explore a solution that enables direct PDF content analysis, bypassing the need for supporting Azure services like Storage Accounts, Embeddings, or Azure AI Search. Instead, we leverage Python to process the PDF file into a format GPT-4o can understand and analyze, making it accessible for content and image insights.

This step-by-step guide is an eye-opener for those who still believe that complex PDF analysis isn’t possible with GPT-4o or GPT-4omini Language models.

The Challenge

Typically, processing PDFs for AI analysis involves steps like:

  • Preprocessing text and image content into embeddings for context and search,
  • Storing and retrieving content using Azure AI Search,
  • Building pipelines to transform PDF images into readable text using OCR.

This is effective, these steps can add complexity and cost, particularly for simpler workflows that just need to retrieve insights directly from a PDF especially for the Realtime processing.

To alleviate this, I created a Python solution which:

  • Extracts text and images from a PDF
  • Delivers content directly to GPT-4o without needing more storage or search layers,
  • GPT-4o can extract meaningful insights and recommendations, further positioning it as a perfect lightweight alternative to other LLMs.

Delivers content directly to GPT-4o without needing more storage or search layers,

GPT-4o is capable of extracting meaningful insights and recommendations, further positioning it as a perfect lightweight alternative to other LLMs.

Why GPT-4o?

The key advantage of GPT-4o lies in its vast context window and adaptability, enabling it to handle varied content, including text and image descriptions, without needing a multimodal upgrade. By strategically structuring the PDF content, GPT-4o can provide comprehensive insights across:

  • Summarization of long documents
  • Contextual analysis
  • Insight extraction from image captions or metadata

This flexibility transforms GPT-4o into an affordable yet potent option for document processing.

The Solution: Python + GPT-4o

Let’s walk through the code that makes this possible. In this example, I use Python and GPT-4o to read and summarize a PDF, all without needing additional Azure services.

Prerequisites:

To replicate this solution, you’ll need:

  1. A Python environment with packages like PyMuPDF (to extract text and images from the PDF),
  2. Access to GPT-4o via Azure OpenAI.

Install required libraries:

pip install pymupdf openai

Code Walkthrough

Here’s a Python script that reads a PDF, extracts text and image content, and sends the processed data to GPT-4o.

import os
import requests
import base64
from pdf2image import convert_from_path
import io
import fitz  # PyMuPDF
# Configuration
API_KEY = "<<Your-GPT-4O Key>>"
pdf_file_path = r"<<your-pdf-file-path>>"  # Use raw string literal
ENDPOINT = "https://<<your-azure-openai-resource-name>>.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-02-15-preview"
text = ""
encoded_images = []
with fitz.open(pdf_file_path) as pdf:
    for page in pdf:
        text += page.get_text()
    images = page.get_images(full=True)
    for img_index, img in enumerate(images):
        xref = img[0]  # image reference number
        base_image = pdf.extract_image(xref)
        image_data = base_image["image"]
        # Save image
        image_filename = os.path.join(r"<Your_PDF File Path WITHOUT FILENAME>", f"img_{img_index + 1}.png")
        with open(image_filename, "wb") as image_file:
            image_file.write(image_data)
        encoded_images.append(base64.b64encode(open(image_filename, 'rb').read()).decode('ascii'))
# Prepare payload for GPT-4
messages = [    {
        "role": "system",
        "content": "You are an AI assistant that helps people analyze the text."
    }
]
messages.append({
    "role": "user",
    "content": f"Analyze the following pdf text: {text} and Image: {encoded_images}"
})
payload = {
    "messages": messages,
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 800
}
headers = {
    "Content-Type": "application/json",
    "api-key": API_KEY,
}
# Send request to GPT-4
try:
    response = requests.post(ENDPOINT, headers=headers, json=payload)
    response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
    response_data = response.json()
    print(response_data)
except requests.RequestException as e:
    raise SystemExit(f"Failed to make the request. Error: {e}")

How It Works

  1. Extracting Text and Image: I used PyMuPDF python library to extract both text and images from the PDF. It was very quick and didn’t take much of processing time for a 10-page PDF file.
  2. GPT-4o Analysis: The extracted text and image are then fed to GPT-4o, which processes in a single request. The major caveat here is to keep the conversation within a manageable context window.
  3. Output: GPT-4o did the summarization of the extracted content and outputs in its standard Json format.

Limitations & Considerations

This method is quite efficient and allows for real-time analysis of PDFs, but there are a few important points to keep in mind:

  • Image Analysis Limitations: GPT-4o can get descriptions and metadata from images but it doesn’t perform true image recognition. So, if your PDF has got images with critical visual data then GPT-4o will only process whatever descriptive data is extracted.
  • Context Window: GPT-4o has got some limited context window. Large PDF files may need to be cut down to smaller sections. After making chunks, managing them may be complex. In those cases, we can use AI agents — Group Conversable agent from AutoGen Framework should work or you can choose any other patterns which fits your complete use case.
  • Potential Enhancements: You can also add OCR capabilities for text extraction from images. LangChain can also help with chunking and structuring PDF content for optimized way of processing.

Conclusion

We saw GPT-4o’s ability to directly analyze your PDF files along with images without needing embedding, Azure AI Search, or any other additional Azure services. This shows the potential of GPT 4O in content processing related work with minimal resources. For those who are looking for a solid solution to achieve this with OpenAI and don’t want to use Gemini, Claude, or other multimodal LLMs just for document analysis, this solution work similarly with better results. Only thing here is that it needs little bit of python knowledge.

This is more than a convenience; it’s a game-changer for those needing leaner, cost-effective solutions without compromising on capability. So, the next time you’re looking to dive into a PDF for insights, consider GPT-4o.

Hope this inspires others to explore what’s possible with GPT-4o. Happy coding!