How to Build an AI Agent with Voice and Vision Using Gemini
Blog Post

How to Build an AI Agent with Voice and Vision Using Gemini

Jake McCluskey
Back to blog

You can build a multimodal AI agent that combines real-time voice conversation with webcam vision using Google's free Gemini Live API by establishing a WebSocket connection that streams audio from your microphone and video frames from your webcam simultaneously while handling bidirectional communication for interruptions. The architecture requires approximately 150 lines of Python code that coordinates three concurrent streams: audio input capture at 16kHz, video frame extraction at 1-2 FPS, and real-time response synthesis with interruption detection. Streaming capabilities work through the WebSocket protocol maintaining persistent connections that allow the AI to process multimodal input continuously while generating responses that can be interrupted mid-sentence when you start speaking again, creating natural conversation flow that mirrors human interaction patterns.

What Makes Gemini Live API Different from Standard Chatbot APIs

Gemini Live API operates on a fundamentally different architecture than traditional request-response chatbot APIs. Instead of sending complete prompts and waiting for full responses, it maintains a persistent WebSocket connection that handles streaming data in both directions simultaneously.

The API processes audio at 16kHz sample rate using PCM16 encoding, which translates to roughly 32KB of data per second of speech. For video input, you send base64-encoded JPEG frames at intervals you control, typically 1-2 frames per second to balance responsiveness with API quota usage. The free tier supports up to 1,500 requests per day with rate limits of 15 requests per minute, which is sufficient for building and testing conversational agents that run for several hours daily.

Traditional text-based APIs from 2023 required you to wait for complete responses before sending new input. Gemini Live's streaming model lets you interrupt the AI while it's speaking, immediately stopping its audio output and processing your new input. This architectural shift reduces perceived latency by approximately 60-70% compared to turn-based conversation systems.

Why Real-Time Multimodal Agents Matter for Practical Applications

Voice and vision combined create interaction patterns that text-only interfaces can't replicate. When you're working hands-free in a workshop, cooking in a kitchen, helping someone with mobility limitations, or assembling furniture, typing isn't practical or possible.

Multimodal agents can process approximately 30% more context than voice-only systems because they see what you're pointing at, reading, or building. A tutoring agent can watch a student work through a math problem on paper and provide guidance in real time. A technical support agent can see error messages on your screen while you describe the problem verbally. These use cases require the agent to correlate visual and auditory input streams, which the Gemini Live API handles natively.

The market for voice AI applications is projected to reach $26.8 billion by 2025, with conversational AI systems that include vision capabilities growing at roughly 35% annually. More importantly, user satisfaction scores for multimodal interactions average 4.2 out of 5 compared to 3.1 for text-only chatbots, according to recent enterprise deployment data.

How to Set Up Your Development Environment for Multimodal Streaming

You'll need Python 3.10 or newer with several specific libraries that handle audio capture, video processing, and WebSocket communication. Start by creating a new virtual environment to isolate dependencies.

Install the required packages using pip. You need the google-genai library for API access, pyaudio for microphone input, opencv-python for webcam capture, and websockets for maintaining the streaming connection:

pip install google-genai pyaudio opencv-python websockets pillow numpy

Get your Gemini API key from Google AI Studio at aistudio.google.com. The free tier provides access to Gemini 2.0 Flash models with multimodal capabilities. Store your API key in an environment variable rather than hardcoding it:

export GEMINI_API_KEY="your_api_key_here"

Test your audio setup by verifying PyAudio can access your microphone. On macOS, you may need to grant terminal or IDE microphone permissions in System Preferences. On Linux, ensure your user is in the audio group. Windows typically works without additional configuration.

Configuring Audio and Video Input Streams

Audio streaming requires specific format parameters that match Gemini Live's expectations. The API accepts 16-bit PCM audio at 16kHz sample rate, mono channel. Your microphone likely captures at 44.1kHz or 48kHz, so you'll need to resample.

Here's the audio configuration that works reliably across platforms:

import pyaudio
import numpy as np

CHUNK_SIZE = 1024
SAMPLE_RATE = 16000
FORMAT = pyaudio.paInt16
CHANNELS = 1

audio = pyaudio.PyAudio()
stream = audio.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=SAMPLE_RATE,
    input=True,
    frames_per_buffer=CHUNK_SIZE
)

For video, you want to capture frames at a rate that provides context without overwhelming the API with redundant data. Capturing 1 frame per second works well for most applications where the visual context changes gradually:

import cv2
import base64
from PIL import Image
import io

cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)

def capture_frame():
    ret, frame = cap.read()
    if not ret:
        return None
    
    # Convert BGR to RGB
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    
    # Compress to JPEG
    pil_image = Image.fromarray(frame_rgb)
    buffer = io.BytesIO()
    pil_image.save(buffer, format="JPEG", quality=70)
    
    # Encode to base64
    return base64.b64encode(buffer.getvalue()).decode('utf-8')

Building the WebSocket Connection for Bidirectional Streaming

The Gemini Live API uses WebSockets to maintain a persistent connection that handles multiple data types simultaneously. You'll send audio chunks and video frames to the API while receiving text transcriptions and audio responses back.

The connection process involves three phases: establishing the WebSocket, sending configuration parameters, and entering the main streaming loop. Here's the core connection setup:

import asyncio
import websockets
import json
import os

GEMINI_API_KEY = os.environ.get('GEMINI_API_KEY')
GEMINI_WS_URL = f"wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key={GEMINI_API_KEY}"

async def connect_to_gemini():
    async with websockets.connect(GEMINI_WS_URL) as ws:
        # Send initial configuration
        config = {
            "setup": {
                "model": "models/gemini-2.0-flash-exp",
                "generation_config": {
                    "response_modalities": ["AUDIO"],
                    "speech_config": {
                        "voice_config": {
                            "prebuilt_voice_config": {
                                "voice_name": "Aoede"
                            }
                        }
                    }
                }
            }
        }
        await ws.send(json.dumps(config))
        
        # Wait for setup confirmation
        response = await ws.recv()
        print(f"Setup confirmed: {response}")
        
        return ws

The model parameter specifies Gemini 2.0 Flash, which supports multimodal input and audio output on the free tier. The response_modalities setting tells the API to generate audio responses rather than text, creating the voice conversation experience.

Sending Audio and Video Data Through the WebSocket

You need to send audio and video data in the format Gemini expects. Audio goes in small chunks (typically 1024 samples) to minimize latency, while video frames are sent at longer intervals (every 1-2 seconds) to provide visual context without consuming excessive bandwidth.

The data format uses JSON messages with specific structure. Audio data gets base64-encoded and sent with MIME type information:

async def send_audio_chunk(ws, audio_data):
    audio_b64 = base64.b64encode(audio_data).decode('utf-8')
    message = {
        "realtime_input": {
            "media_chunks": [{
                "mime_type": "audio/pcm",
                "data": audio_b64
            }]
        }
    }
    await ws.send(json.dumps(message))

async def send_video_frame(ws, frame_b64):
    message = {
        "realtime_input": {
            "media_chunks": [{
                "mime_type": "image/jpeg",
                "data": frame_b64
            }]
        }
    }
    await ws.send(json.dumps(message))

You'll run these send operations in a loop that continuously captures input from your microphone and webcam. The key is coordinating timing so audio streams constantly while video updates periodically.

Implementing Interruption Handling for Natural Conversation Flow

Interruption handling is what makes conversations with your agent feel natural rather than robotic. When you start speaking while the AI is talking, the system needs to detect your voice, stop the AI's audio output, and begin processing your new input.

The Gemini Live API handles server-side interruption detection automatically. When it receives new audio input while generating a response, it stops the current generation and begins processing the new input. On your client side, you need to stop playing the AI's audio output immediately when you detect the interruption. This typically happens within 200-300 milliseconds from when you start speaking, which research shows is the threshold where conversations feel responsive rather than laggy.

Here's how to implement client-side interruption detection using audio level monitoring. The approach tracks when your microphone picks up sound above a threshold while the AI is speaking:

import numpy as np

class InterruptionHandler:
    def __init__(self, threshold=500, chunk_size=1024):
        self.threshold = threshold
        self.chunk_size = chunk_size
        self.is_ai_speaking = False
        
    def detect_interruption(self, audio_chunk):
        """Check if user is speaking while AI is talking"""
        if not self.is_ai_speaking:
            return False
            
        # Calculate audio level
        audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
        audio_level = np.abs(audio_array).mean()
        
        return audio_level > self.threshold
    
    def start_ai_speech(self):
        self.is_ai_speaking = True
        
    def stop_ai_speech(self):
        self.is_ai_speaking = False

When an interruption is detected, you send a message to stop audio playback and clear any buffered audio chunks. This prevents the awkward experience of the AI continuing to talk over you for several seconds after you've started speaking.

The threshold value of 500 works for typical environments but you may need to adjust it based on background noise levels. In quiet environments, 300-400 works better. In noisy environments, increase to 700-1000 to avoid false interruptions. Testing with real usage patterns helps you find the right balance, and honestly, getting this tuning right makes more difference to user experience than almost any other parameter.

Processing Responses and Synthesizing Audio Output

The Gemini Live API returns responses as a stream of JSON messages. Each message can contain different types of data: text transcriptions of what you said, text of what the AI is saying, or audio chunks of the AI's spoken response.

You need to handle these message types differently. Audio chunks should be played immediately to maintain conversational flow. Text transcriptions can be displayed on screen for accessibility or debugging. Here's the response handling structure:

async def handle_responses(ws, audio_output_stream):
    async for message in ws:
        data = json.loads(message)
        
        if "serverContent" in data:
            content = data["serverContent"]
            
            # Handle audio output
            if "modelTurn" in content:
                for part in content["modelTurn"].get("parts", []):
                    if "inlineData" in part:
                        audio_b64 = part["inlineData"]["data"]
                        audio_bytes = base64.b64decode(audio_b64)
                        
                        # Play audio chunk
                        audio_output_stream.write(audio_bytes)
                        
            # Handle text transcription
            if "turnComplete" in content:
                print("Turn complete")
                
        elif "toolCall" in data:
            # Handle function calling if implemented
            pass

Audio output requires setting up a PyAudio output stream with the same format parameters as input (16kHz, 16-bit PCM, mono). The API returns audio in this format, so you can write it directly to the output stream without resampling:

output_stream = audio.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=SAMPLE_RATE,
    output=True,
    frames_per_buffer=CHUNK_SIZE
)

Buffer management is important here. If you write audio chunks too slowly, you'll hear gaps and stuttering. If your processing can't keep up with the incoming audio rate, you'll accumulate latency. Monitoring the output buffer fill level helps you detect and handle these issues.

Complete Implementation: The Full Agent Loop

Bringing all components together requires coordinating three concurrent tasks: capturing and sending input, receiving and playing responses, and monitoring for interruptions. Python's asyncio provides the tools to run these tasks simultaneously.

The main loop structure uses asyncio.gather to run input and output handlers concurrently. This pattern ensures audio streaming continues smoothly while video frames are captured and sent at their slower rate. In typical implementations, the main loop processes approximately 16,000 audio samples per second (1 second of speech) while sending 1-2 video frames per second, creating a data flow ratio of roughly 8000:1 between audio and video.

Here's the complete agent implementation that ties everything together:

import asyncio
import pyaudio
import cv2
import base64
import json
import websockets
import os
from io import BytesIO
from PIL import Image

class MultimodalAgent:
    def __init__(self):
        self.api_key = os.environ.get('GEMINI_API_KEY')
        self.ws_url = f"wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key={self.api_key}"
        
        # Audio setup
        self.audio = pyaudio.PyAudio()
        self.sample_rate = 16000
        self.chunk_size = 1024
        
        self.input_stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunk_size
        )
        
        self.output_stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self.sample_rate,
            output=True,
            frames_per_buffer=self.chunk_size
        )
        
        # Video setup
        self.cap = cv2.VideoCapture(0)
        self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
        self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
        
        self.running = True
        self.is_ai_speaking = False
        
    async def setup_connection
Ready to stop reading and start shipping?

Get a free AI-powered SEO audit of your site

We'll crawl your site, benchmark your local pack, and hand you a prioritized fix list in minutes. No call required.

Run my free audit
WANT THE SHORTCUT

Need help applying this to your business?

The post above is the framework. Spend 30 minutes with me and we'll map it to your specific stack, budget, and timeline. No pitch, just a real scoping conversation.