Voice RAG with GPT-4O Realtime
All about GPT-4O realtime api and a Step-by-Step Guide to implementing Voice RAG with Practical Python example
Created Oct 14, 2024 - Last updated: Oct 14, 2024
The accompanying code for this tutorial is: here
Introduction
In the ever-evolving landscape of artificial intelligence, the introduction of the Azure OpenAI GPT-4o Realtime API marks a significant milestone. As an AI enthusiast and developer, I was thrilled to explore this cutting-edge technology and its potential applications. This blog delves into the intricacies of the GPT-4o Realtime API, exploring its features, capabilities, and practical uses. Whether youâre a seasoned developer or an AI enthusiast, this comprehensive guide will provide you with a detailed understanding of how to leverage the GPT-4o Realtime API for creating immersive, real-time speech-to-speech experiences.
Whatâs GPT-4o Realtime? đ€
The GPT-4o Realtime is designed to enable developers to build low-latency, multimodal experiences in their applications. Imagine having natural, seamless conversations with AI-powered voice assistants that understand and respond in real-time. Unlike traditional methods that required multiple models to handle speech recognition, text processing, and speech synthesis, the GPT-4o Realtime API streamlines the process into a single API call, significantly reducing latency and improving the naturalness of interactions.
Demođ±
[other]Video by Author[/other]
Key Features and Capabilities
Low-Latency Speech-to-Speech Interactions
The GPT-4o Realtime API supports fast, real-time speech-to-speech interactions. This is achieved through a persistent WebSocket connection that allows for asynchronous streaming communication between the user and the model. This setup ensures that responses are generated quickly, maintaining the flow of natural conversation.
Multimodal Support
The API is capable of handling various input and output modalities, including text, audio, and function calls. This flexibility allows developers to create rich, interactive experiences that can respond to user inputs in multiple formats.
Function Calling
One of the standout features of the GPT-4o Realtime API is its support for function calling. This enables voice assistants to perform actions or retrieve context-specific information based on user requests. For example, a voice assistant could place an order or fetch customer details to personalize responses.
Voice Activity Detection (VAD)
The API includes advanced voice activity detection capabilities, which automatically handle interruptions and manage the flow of conversation. This ensures that the system can respond appropriately to user inputs without unnecessary delays.
Integration with Existing Tools
The GPT-4o Realtime API is designed to work seamlessly with existing tools and services. For instance, it can be integrated with Twilioâs Voice APIs to build and deploy AI virtual agents that interact with customers via voice calls.
Practical Applications and Use Cases
The GPT-4o Realtime API opens up a plethora of possibilities for developers across various domains. Here are some practical applications:
Customer Support
By integrating the GPT-4o Realtime API, businesses can enhance their customer support systems with AI-powered voice assistants that provide quick and accurate responses to customer queries. This can significantly improve customer satisfaction and reduce the workload on human agents.
Language Learning
Language learning apps can leverage the API to create interactive role-play scenarios where users can practice conversations in a new language. The real-time feedback and natural interaction can make the learning process more engaging and effective.
Healthcare
In the healthcare sector, the API can be used to develop virtual health assistants that provide patients with timely information and support. For example, a nutrition and fitness coaching app could use the API to enable natural conversations with an AI coach, offering personalized advice and motivation.
Accessibility
The GPT-4o Realtime API can also be a game-changer for accessibility. It can be used to develop tools that assist individuals with disabilities, such as voice-activated interfaces for controlling smart home devices or real-time transcription services for the hearing impaired.
Personalized Shopping Assistants
E-commerce platforms can deploy virtual shopping assistants that help customers find products, answer questions, and provide personalized recommendations based on their preferences and past purchases.
Interactive Voice Response (IVR) Systems
Traditional IVR systems can be enhanced with GPT-4o Realtime to provide more natural and intuitive interactions. Instead of navigating through a series of menu options, customers can simply speak their requests, and the system can understand and respond appropriately.
Getting Started with GPT-4o Realtime API
To begin using the GPT-4o Realtime API, developers need to create an Azure OpenAI resource in a supported region (e.g., eastus2 or swedencentral) and deploy the gpt-4o-realtime-preview model. The API requires a secure WebSocket connection, which can be established using the following URI format:
wss://<your-resource-name>.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=gpt-4o-realtime-preview
Authentication can be handled using either a Bearer token (for managed identity) or an API key. Once authenticated, developers can configure the session to customize input and output behaviors, such as audio format, transcription models, and turn detection settings.
In this sample implementation, we are covering both structured and unstructured data along with the capabilities of the GPT-4o Realtime. Letâs walk through the process of implementing a real-time voice assistant. This example will use the Azure OpenAIâs gpt-4o-realtime-preview model and GPT-4O for Text-SQL.
Here is the high-level design:
Weâll follow 4 steps to get this example running in your own environment: pre-requisites, creating an index (for Unstructured Data), Setting up SQL with data, setting up the environment, and running the app.
- Pre-requisites ==================
Youâll need instances of the following Azure services. You can re-use service instances you have already or create new ones.
-
Azure OpenAI, with 3 model deployments, one of the gpt-4o-realtime-preview models, one for embeddings (e.g.text-embedding-3-large, text-embedding-3-small, or text-embedding-ada-002) and one GPT 4O
-
Azure AI Search, any tier Basic or above will work, ideally with Semantic Search enabled
-
Azure Blob Storage, with a container that has the content that represents your knowledge base (we include some sample data in this repo if you want an easy starting point)
-
Azure SQL, refer data/structured/SQL DDL and Sample Data.txt in this repo for DDL and SQL Insert statements.
-
Creating an index for Unstructured Data ===========================================
RAG applications use a retrieval system to get the right grounding data for LLMs. We use Azure AI Search as our retrieval system, so we need to get our knowledge base (e.g. documents or any other content you want the app to be able to talk about) into an Azure AI Search index.
If you already have an Azure AI Search index
You can use an existing index directly. If you created that index using the âImport and vectorize dataâ option in the portal, no further changes are needed. Otherwise, youâll need to update the field names in the code to match your text/vector fields.
Creating a new index with sample data or your own
Follow these steps to create a new index. Weâll create a setup where once created, you can add, delete, or update your documents in blob storage and the index will automatically follow the changes.
- Upload your documents to an Azure Blob Storage container. An easy way to do this is using the Azure Portal: navigate to the container and use the upload option to move your content (e.g. PDFs, Office docs, etc.)
- In the Azure Portal, go to your Azure AI Search service and select âImport and vectorize dataâ, choose Blob Storage, then point at your container and follow the rest of the steps on the screen.
- Once the indexing process completes, youâll have a search index ready for vector and hybrid search.
For more details on ingesting data in Azure AI Search using âImport and vectorize dataâ, hereâs a quickstart.
2.1 Prepare for Structured Data
You can use the SQL statements here audio-rag-structured-unstructured-data/data/structured at main · navintkr/audio-rag-structured-unstructured-data (github.com) and prepare for structured demo.
- Setting up the environment ==============================
The app needs to know which service endpoints to use for the Azure OpenAI and Azure AI Search. The following variables can be set as environment variables, or you can create a â.envâ file in the âapp/backend/â directory with this content.
AZURE_OPENAI_ENDPOINT=wss://<your instance name>.openai.azure.com
AZURE_OPENAI_DEPLOYMENT=gpt-4o-realtime-preview
AZURE_OPENAI_API_KEY=<your api key>
AZURE_SEARCH_ENDPOINT=https://<your service name>.search.windows.net
AZURE_SEARCH_INDEX=<your index name>
AZURE_SEARCH_API_KEY=<your api key>
OPENAI_CHAT_MODEL=gpt-4o
AZURE_SQL_SERVER=<your-SQL-Server-Name>
AZURE_SQL_DB=<your-SQL-DB-Name>
AZURE_SQL_USERNAME=<your-SQL-Server-Username>
AZURE_SQL_PWD=<your-SQL-Server-Password>
To use Entra ID (your user when running locally, managed identity when deployed) simply donât set the keys.
-
Running the app ===================
-
Install the required tools: Node.js, Python and Powershell.
-
Clone the repo (
git clone [https://github.com/navintkr/audio-rag-structured-unstructured-data](https://github.com/navintkr/audio-rag-structured-unstructured-data))
)) -
Create a Python virtual environment and activate it.
-
The app needs to know which service endpoints to use for the Azure OpenAI and Azure AI Search. The following variables can be set as environment variables, or you can create a â.envâ file in the âapp/backend/â directory with this content.
AZURE_OPENAI_ENDPOINT=wss://<your instance name>.openai.azure.com
AZURE_OPENAI_DEPLOYMENT=gpt-4o-realtime-preview
AZURE_OPENAI_API_KEY=<your api key>
AZURE_SEARCH_ENDPOINT=https://<your service name>.search.windows.net
AZURE_SEARCH_INDEX=<your index name>
AZURE_SEARCH_API_KEY=<your api key>
OPENAI_CHAT_MODEL=gpt-4o
AZURE_SQL_SERVER=<your-SQL-Server-Name>
AZURE_SQL_DB=<your-SQL-DB-Name>
AZURE_SQL_USERNAME=<your-SQL-Server-Username>
AZURE_SQL_PWD=<your-SQL-Server-Password>
To use Entra ID (your user when running locally, managed identity when deployed) simply donât set the keys.
- Run this command to start the app:
Windows:
cd app
pwsh .\start.ps1
Linux/Mac:
cd app ./start.sh
The app should be available on http://localhost:8765
Once the app is running, when you navigate to the URL above you should see the start screen of the app:
You can use the dropdown to tweak between structured and unstructured data
Frontend: enabling direct communication with AOAI Realtime API
You can make the frontend skip the middle tier and talk to the WebSockets AOAI Realtime API directly if you choose to do so. However, note thisâll stop RAG from happening and will require exposing your API key in the frontend, which is very insecure. DO NOT use this in production.
Just Pass some extra parameters to the useRealtime
hook:
const { startSession, addUserAudio, inputAudioBufferClear } = useRealTime({
useDirectAoaiApi: true,
aoaiEndpointOverride: "wss://<NAME>.openai.azure.com",
aoaiApiKeyOverride: "<YOUR API KEY, INSECURE!!!>",
aoaiModelOverride: "gpt-4o-realtime-preview",
...
);
Conclusion
The Azure OpenAI GPT-4o Realtime API represents a significant advancement in the field of AI, offering developers the tools to create highly responsive and natural speech-to-speech interactions. By leveraging this technology, businesses and developers can build innovative applications that enhance user experiences across various domains, from customer support and language learning to healthcare and accessibility.
For those interested in exploring the full potential of the GPT-4o Realtime API, I encourage you to dive into the detailed documentation and sample code available on GitHub. With the right approach and creativity, the possibilities are endless.
Happy coding!