How can I simulate real-time streaming transcription using OpenAI API? #2307
Replies: 2 comments 2 replies
-
|
You're on the right path of emulating streaming transcription with Whisper — that's the best workaround available at the moment, TheOpenAI's whisper-1 API is only capable of batch processing and not streaming. let me describe it for you -
Method: Chunked Streaming Simulation
If you're interested, I can help you set up a full real-time transcription. |
Beta Was this translation helpful? Give feedback.
-
|
Chunked audio is indeed the best approach for simulating streaming transcription with Whisper. But there are some important details to get right: Working approach: overlapping chunks with VAD import asyncio
import io
import numpy as np
import sounddevice as sd
from openai import AsyncOpenAI
client = AsyncOpenAI()
SAMPLE_RATE = 16000
CHUNK_DURATION = 3 # seconds per chunk
OVERLAP = 0.5 # overlap between chunks to avoid cutting words
async def transcribe_chunk(audio_bytes: bytes) -> str:
buf = io.BytesIO(audio_bytes)
buf.name = "chunk.wav" # Whisper needs a filename hint
transcript = await client.audio.transcriptions.create(
model="whisper-1",
file=buf,
language="en",
)
return transcript.text
async def stream_transcription():
chunk_samples = int(SAMPLE_RATE * CHUNK_DURATION)
overlap_samples = int(SAMPLE_RATE * OVERLAP)
buffer = np.array([], dtype=np.float32)
def audio_callback(indata, frames, time, status):
nonlocal buffer
buffer = np.append(buffer, indata[:, 0])
with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, callback=audio_callback):
while True:
await asyncio.sleep(CHUNK_DURATION - OVERLAP)
if len(buffer) < chunk_samples:
continue
# Take chunk with overlap
chunk = buffer[:chunk_samples]
buffer = buffer[chunk_samples - overlap_samples:]
# Convert to WAV bytes
import wave
wav_buf = io.BytesIO()
with wave.open(wav_buf, "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(SAMPLE_RATE)
wf.writeframes((chunk * 32767).astype(np.int16).tobytes())
text = await transcribe_chunk(wav_buf.getvalue())
if text.strip():
print(text, end=" ", flush=True)
asyncio.run(stream_transcription())Key improvements over naive chunking:
Alternative: OpenAI Realtime API If you need true real-time transcription (not simulated), OpenAI's Realtime API supports audio streaming natively via WebSockets. It's a different endpoint and pricing model, but it gives you actual incremental transcription: # Realtime API uses WebSocket, not the REST transcription endpoint
# See: https://platform.openai.com/docs/guides/realtimeThere's no official roadmap for adding streaming to the Whisper REST API specifically, but the Realtime API is effectively OpenAI's answer to that use case. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm working on a project where I want to convert speech to text in real-time using OpenAI's Whisper model. I see that Whisper's hosted API (whisper-1) currently only supports batch mode — sending a full audio file and receiving the full transcript.
I'm trying to achieve a streaming-like transcription experience, where I can start receiving partial transcriptions as audio is still being recorded or uploaded.
Is there a way to simulate streaming transcription using Whisper?
I'm using Python.
I considered chunking the audio into small parts and sending them sequentially.
Is that the best approach, or is there a better method?
Also, is there any public roadmap or timeline for when the official OpenAI Whisper API might support real-time streaming transcription?
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions