Naveen Krishnan

Towards AI

Photo from oneusefulthing.org

The accompanying code for this tutorial is: here

Introduction

In the ever-evolving landscape of artificial intelligence, the introduction of the Azure OpenAI GPT-4o Realtime API marks a significant milestone. As an AI enthusiast and developer, I was thrilled to explore this cutting-edge technology and its potential applications. This blog delves into the intricacies of the GPT-4o Realtime API, exploring its features, capabilities, and practical uses. Whether you’re a seasoned developer or an AI enthusiast, this comprehensive guide will provide you with a detailed understanding of how to leverage the GPT-4o Realtime API for creating immersive, real-time speech-to-speech experiences.

What’s GPT-4o Realtime? đŸ€”

The GPT-4o Realtime is designed to enable developers to build low-latency, multimodal experiences in their applications. Imagine having natural, seamless conversations with AI-powered voice assistants that understand and respond in real-time. Unlike traditional methods that required multiple models to handle speech recognition, text processing, and speech synthesis, the GPT-4o Realtime API streamlines the process into a single API call, significantly reducing latency and improving the naturalness of interactions.

DemođŸ˜±

[other]Video by Author[/other]

Key Features and Capabilities

Low-Latency Speech-to-Speech Interactions

The GPT-4o Realtime API supports fast, real-time speech-to-speech interactions. This is achieved through a persistent WebSocket connection that allows for asynchronous streaming communication between the user and the model. This setup ensures that responses are generated quickly, maintaining the flow of natural conversation.

Multimodal Support

The API is capable of handling various input and output modalities, including text, audio, and function calls. This flexibility allows developers to create rich, interactive experiences that can respond to user inputs in multiple formats.

Function Calling

One of the standout features of the GPT-4o Realtime API is its support for function calling. This enables voice assistants to perform actions or retrieve context-specific information based on user requests. For example, a voice assistant could place an order or fetch customer details to personalize responses.

Voice Activity Detection (VAD)

The API includes advanced voice activity detection capabilities, which automatically handle interruptions and manage the flow of conversation. This ensures that the system can respond appropriately to user inputs without unnecessary delays.

Integration with Existing Tools

The GPT-4o Realtime API is designed to work seamlessly with existing tools and services. For instance, it can be integrated with Twilio’s Voice APIs to build and deploy AI virtual agents that interact with customers via voice calls.

Practical Applications and Use Cases

The GPT-4o Realtime API opens up a plethora of possibilities for developers across various domains. Here are some practical applications:

Customer Support

By integrating the GPT-4o Realtime API, businesses can enhance their customer support systems with AI-powered voice assistants that provide quick and accurate responses to customer queries. This can significantly improve customer satisfaction and reduce the workload on human agents.

Language Learning

Language learning apps can leverage the API to create interactive role-play scenarios where users can practice conversations in a new language. The real-time feedback and natural interaction can make the learning process more engaging and effective.

Healthcare

In the healthcare sector, the API can be used to develop virtual health assistants that provide patients with timely information and support. For example, a nutrition and fitness coaching app could use the API to enable natural conversations with an AI coach, offering personalized advice and motivation.

Accessibility

The GPT-4o Realtime API can also be a game-changer for accessibility. It can be used to develop tools that assist individuals with disabilities, such as voice-activated interfaces for controlling smart home devices or real-time transcription services for the hearing impaired.

Personalized Shopping Assistants

E-commerce platforms can deploy virtual shopping assistants that help customers find products, answer questions, and provide personalized recommendations based on their preferences and past purchases.

Interactive Voice Response (IVR) Systems

Traditional IVR systems can be enhanced with GPT-4o Realtime to provide more natural and intuitive interactions. Instead of navigating through a series of menu options, customers can simply speak their requests, and the system can understand and respond appropriately.

Getting Started with GPT-4o Realtime API

To begin using the GPT-4o Realtime API, developers need to create an Azure OpenAI resource in a supported region (e.g., eastus2 or swedencentral) and deploy the gpt-4o-realtime-preview model. The API requires a secure WebSocket connection, which can be established using the following URI format:

wss://<your-resource-name>.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview

Authentication can be handled using either a Bearer token (for managed identity) or an API key. Once authenticated, developers can configure the session to customize input and output behaviors, such as audio format, transcription models, and turn detection settings.

In this sample implementation, we are covering both structured and unstructured data along with the capabilities of the GPT-4o Realtime. Let’s walk through the process of implementing a real-time voice assistant. This example will use the Azure OpenAI’s gpt-4o-realtime-preview model and GPT-4O for Text-SQL.

Here is the high-level design:

Image by Author

We’ll follow 4 steps to get this example running in your own environment: pre-requisites, creating an index (for Unstructured Data), Setting up SQL with data, setting up the environment, and running the app.

  1. Pre-requisites ==================

You’ll need instances of the following Azure services. You can re-use service instances you have already or create new ones.

  1. Azure OpenAI, with 3 model deployments, one of the gpt-4o-realtime-preview models, one for embeddings (e.g.text-embedding-3-large, text-embedding-3-small, or text-embedding-ada-002) and one GPT 4O

  2. Azure AI Search, any tier Basic or above will work, ideally with Semantic Search enabled

  3. Azure Blob Storage, with a container that has the content that represents your knowledge base (we include some sample data in this repo if you want an easy starting point)

  4. Azure SQL, refer data/structured/SQL DDL and Sample Data.txt in this repo for DDL and SQL Insert statements.

  5. Creating an index for Unstructured Data ===========================================

RAG applications use a retrieval system to get the right grounding data for LLMs. We use Azure AI Search as our retrieval system, so we need to get our knowledge base (e.g. documents or any other content you want the app to be able to talk about) into an Azure AI Search index.

If you already have an Azure AI Search index

You can use an existing index directly. If you created that index using the “Import and vectorize data” option in the portal, no further changes are needed. Otherwise, you’ll need to update the field names in the code to match your text/vector fields.

Creating a new index with sample data or your own

Follow these steps to create a new index. We’ll create a setup where once created, you can add, delete, or update your documents in blob storage and the index will automatically follow the changes.

  1. Upload your documents to an Azure Blob Storage container. An easy way to do this is using the Azure Portal: navigate to the container and use the upload option to move your content (e.g. PDFs, Office docs, etc.)
  2. In the Azure Portal, go to your Azure AI Search service and select “Import and vectorize data”, choose Blob Storage, then point at your container and follow the rest of the steps on the screen.
  3. Once the indexing process completes, you’ll have a search index ready for vector and hybrid search.

For more details on ingesting data in Azure AI Search using “Import and vectorize data”, here’s a quickstart.

2.1 Prepare for Structured Data

You can use the SQL statements here audio-rag-structured-unstructured-data/data/structured at main · navintkr/audio-rag-structured-unstructured-data (github.com) and prepare for structured demo.

  1. Setting up the environment ==============================

The app needs to know which service endpoints to use for the Azure OpenAI and Azure AI Search. The following variables can be set as environment variables, or you can create a “.env” file in the “app/backend/” directory with this content.

AZURE_OPENAI_ENDPOINT=wss://<your instance name>.openai.azure.com
AZURE_OPENAI_DEPLOYMENT=gpt-4o-realtime-preview
AZURE_OPENAI_API_KEY=<your api key>
AZURE_SEARCH_ENDPOINT=https://<your service name>.search.windows.net
AZURE_SEARCH_INDEX=<your index name>
AZURE_SEARCH_API_KEY=<your api key>
OPENAI_CHAT_MODEL=gpt-4o
AZURE_SQL_SERVER=<your-SQL-Server-Name>
AZURE_SQL_DB=<your-SQL-DB-Name>
AZURE_SQL_USERNAME=<your-SQL-Server-Username>
AZURE_SQL_PWD=<your-SQL-Server-Password>

To use Entra ID (your user when running locally, managed identity when deployed) simply don’t set the keys.

  1. Running the app ===================

  2. Install the required tools: Node.js, Python and Powershell.

  3. Clone the repo (git clone [https://github.com/navintkr/audio-rag-structured-unstructured-data](https://github.com/navintkr/audio-rag-structured-unstructured-data))))

  4. Create a Python virtual environment and activate it.

  5. The app needs to know which service endpoints to use for the Azure OpenAI and Azure AI Search. The following variables can be set as environment variables, or you can create a “.env” file in the “app/backend/” directory with this content.

AZURE_OPENAI_ENDPOINT=wss://<your instance name>.openai.azure.com
AZURE_OPENAI_DEPLOYMENT=gpt-4o-realtime-preview
AZURE_OPENAI_API_KEY=<your api key>
AZURE_SEARCH_ENDPOINT=https://<your service name>.search.windows.net
AZURE_SEARCH_INDEX=<your index name>
AZURE_SEARCH_API_KEY=<your api key>
OPENAI_CHAT_MODEL=gpt-4o
AZURE_SQL_SERVER=<your-SQL-Server-Name>
AZURE_SQL_DB=<your-SQL-DB-Name>
AZURE_SQL_USERNAME=<your-SQL-Server-Username>
AZURE_SQL_PWD=<your-SQL-Server-Password>

To use Entra ID (your user when running locally, managed identity when deployed) simply don’t set the keys.

  1. Run this command to start the app:

Windows:

cd app 
pwsh .\start.ps1

Linux/Mac:

cd app ./start.sh

The app should be available on http://localhost:8765

Once the app is running, when you navigate to the URL above you should see the start screen of the app:

captionless image

You can use the dropdown to tweak between structured and unstructured data

Frontend: enabling direct communication with AOAI Realtime API

You can make the frontend skip the middle tier and talk to the WebSockets AOAI Realtime API directly if you choose to do so. However, note this’ll stop RAG from happening and will require exposing your API key in the frontend, which is very insecure. DO NOT use this in production.

Just Pass some extra parameters to the useRealtime hook:

const { startSession, addUserAudio, inputAudioBufferClear } = useRealTime({
        useDirectAoaiApi: true,
        aoaiEndpointOverride: "wss://<NAME>.openai.azure.com",
        aoaiApiKeyOverride: "<YOUR API KEY, INSECURE!!!>",
        aoaiModelOverride: "gpt-4o-realtime-preview",
        ...
);

Conclusion

The Azure OpenAI GPT-4o Realtime API represents a significant advancement in the field of AI, offering developers the tools to create highly responsive and natural speech-to-speech interactions. By leveraging this technology, businesses and developers can build innovative applications that enhance user experiences across various domains, from customer support and language learning to healthcare and accessibility.

For those interested in exploring the full potential of the GPT-4o Realtime API, I encourage you to dive into the detailed documentation and sample code available on GitHub. With the right approach and creativity, the possibilities are endless.

Happy coding!