
Whisper is a leading open-source transcription model released by OpenAI. In this post, we'll provide a quick overview of why WebSockets make sense for real-time speech transcription through Whisper V3 and walk through a complete implementation example on Baseten.
Traditional HTTP request-based inference creates significant friction for streaming applications due to connection overhead and latency. WebSockets navigates around this by maintaining a persistent, bidirectional connection between the server and client, enabling seamless real-time communication. Rather than requiring headers and new connections for each request, WebSockets establish a single channel for continuous data exchange.
This makes WebSockets ideal for real-time applications like AI transcription, live phone calls, and agentic interactions; clients can stream audio chunks while simultaneously receiving transcription results without interruption.
Let’s step through a code example of how to stream from your laptop microphone and get your voice transcribed in real-time, which is adaptable to any production web application with a Python backend.
step 1: deploy whisper streaming large v3
Baseten created a custom implementation of Whisper with support for streaming over WebSockets. Start by deploying this implementation from the model library.
After the model appears in your account, save the Baseten model ID and generate a Baseten API key if you don’t already have one.

Then, create a Baseten API key from the API keys page and save it as an environment variable:
export BASETEN_API_KEY=my_api_key
step 2: understanding websocket streaming in practice
Let’s understand the code we will run to transcribe our voice. You can find the complete code here, so you don’t need to worry about copy-pasting the following code chunks.
We first need to install the required packages for using WebSockets, capturing audio, and converting data. You can paste this into your terminal:
pip install websockets sounddevice numpy
You can now paste in the model ID for the deployed Whisper model into the model_id
variable. We also set metadata for our transcription that will be passed to the endpoint initially.
1import asyncio
2import websockets
3import sounddevice as sd
4import numpy as np
5import json
6import os
7
8# Audio config
9SAMPLE_RATE = 16000
10CHUNK_SIZE = 512
11CHUNK_DURATION = CHUNK_SIZE / SAMPLE_RATE
12CHANNELS = 1
13
14headers = {"Authorization": f"Api-Key {os.getenv('BASETEN_API_KEY')}"}
15model_id = "BASETEN_MODEL_ID"
16
17metadata = {
18 "vad_params": {
19 "threshold": 0.5,
20 "min_silence_duration_ms": 300,
21 "speech_pad_ms": 30
22 },
23 "streaming_whisper_params": {
24 "encoding": "pcm_s16le",
25 "sample_rate": 16000,
26 "enable_partial_transcripts": True,
27 "audio_language": "en"
28 }
29}
Now this is the meat of the code. We will break this down section by section. We start an asynchronous event loop, connect to the Whisper server running Baseten with .connect
, and send the configurations of Whisper via metadata.
async def stream_microphone_audio(ws_url):
loop = asyncio.get_running_loop()
async with websockets.connect(ws_url, additional_headers=headers) as ws:
print("Connected to server")
await ws.send(json.dumps(metadata))
print("Sent metadata to server")
We initialize a queue for the audio data and create a callback function that triggers when the microphone captures audio, converting the input data into 16-bit integers and putting this final data form into the queue in a threadsafe way for sending. This queue acts as a buffer between the real-time audio capture and the WebSocket transmission, allowing the two processes to run independently.
send_queue = asyncio.Queue()
def audio_callback(indata, frames, time_info, status):
if status:
print(f"Audio warning: {status}")
int16_data = (indata * 32767).astype(np.int16).tobytes()
loop.call_soon_threadsafe(send_queue.put_nowait, int16_data)
We start the input stream with sounddevice
and the with keyword, Python’s context manager for graceful cleanup.
with sd.InputStream(
samplerate=SAMPLE_RATE,
blocksize=CHUNK_SIZE,
channels=CHANNELS,
dtype="float32",
callback=audio_callback,
):
print("Streaming mic audio...")
Inside it, we define two async functions. send_audio
continuously checks the queue for new audio, sending to the server in chunks indefinitely. recieve_server_messages
continuously receives transcribed text from the WebSocket and prints. We also use asyncio.gather to keep both async functions running concurrently.
1async def send_audio():
2 while True:
3 chunk = await send_queue.get()
4 await ws.send(chunk)
5
6async def receive_server_messages():
7 while True:
8 response = await ws.recv()
9 try:
10 message = json.loads(response)
11 is_final = message.get("is_final")
12 transcript = message.get("transcript")
13 if not is_final:
14 print(f"[partial] {transcript}")
15 elif is_final:
16 print(f"[final] {transcript}")
17 else:
18 print(f"[unknown type] {message}")
19 except Exception as e:
20 print("Non-JSON message or parse error:", response, "| Error:", str(e))
21 # Run send + receive tasks concurrently
22 await asyncio.gather(send_audio(), receive_server_messages())
23
24ws_url = f"wss://model-{model_id}.api.baseten.co/environments/production/websocket"
25asyncio.run(stream_microphone_audio(ws_url))
step 3: running the script
Go into the terminal, paste in all the parts of the code, and run python websocket.py
. Start talking into the mic and you can view the partial transcriptions being sent back to you in real time!

streaming speech-to-text in production
You can run real-time transcription with Whipser by concurrently sending audio data and receiving text data from an endpoint with our streaming Whisper implementation. With WebSockets, each user maintains a connection with the server for the entire duration of their session. In production, it’s important to set appropriate autoscaling settings so that you can support more concurrent users as the number of active connections grows.
Whatever you’re transcribing, run it on Baseten with production-grade real-time speech-to-text inference with Whisper V3 on WebSockets.
Subscribe to our newsletter
Stay up to date on model performance, GPUs, and more.