Add voice mode with speech-to-text and text-to-speech#368
Add voice mode with speech-to-text and text-to-speech#3682witstudios wants to merge 3 commits intomasterfrom
Conversation
- Add voice mode Zustand store for state management - Add /api/voice/transcribe endpoint using OpenAI Whisper - Add /api/voice/synthesize endpoint using OpenAI TTS - Add useVoiceMode hook for audio recording and playback - Add VoiceModeOverlay with tap-to-speak and barge-in modes - Add VoiceModeSettings for voice/speed configuration - Add voice mode toggle to InputFooter (requires OpenAI API key) - Integrate with GlobalAssistantView chat flow The base AI model remains the user's selected model - voice mode only handles input/output via STT/TTS. https://claude.ai/code/session_0126CHZ5h1Gnv5kKT4TaUFK2
- Add voice mode support to AiChatView (Page AI Chat) - Add voice mode support to SidebarChatTab (Sidebar AI Assistant) - Both now show voice mode button when OpenAI is configured - Voice transcripts send through existing chat flows https://claude.ai/code/session_0126CHZ5h1Gnv5kKT4TaUFK2
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughThis PR introduces comprehensive voice mode functionality to the application, adding text-to-speech and speech-to-text capabilities via OpenAI APIs, a React hook for voice state management, a Zustand store for shared voice mode state, interactive UI components for voice interaction, and integration points across multiple chat interfaces. Changes
Sequence DiagramsequenceDiagram
participant User
participant VoiceOverlay as VoiceModeOverlay
participant Hook as useVoiceMode Hook
participant Transcribe as /api/voice/transcribe
participant Whisper as OpenAI Whisper
participant Chat as Chat Component
participant Synthesize as /api/voice/synthesize
participant TTS as OpenAI TTS
participant Audio as Web Audio API
User->>VoiceOverlay: Tap mic / Press space to start
VoiceOverlay->>Hook: startListening()
Hook->>Hook: Activate MediaRecorder
activate Hook
User->>Audio: Speak into microphone
Audio->>Hook: Capture audio data
deactivate Hook
User->>VoiceOverlay: Stop speaking (release or timeout)
VoiceOverlay->>Hook: stopListening()
Hook->>Transcribe: POST audio file
Transcribe->>Whisper: Forward audio with API key
Whisper-->>Transcribe: Transcription text
Transcribe-->>Hook: Return JSON with transcript
Hook->>Hook: Process transcript
Hook->>VoiceOverlay: Display transcript
VoiceOverlay->>VoiceOverlay: onSend(transcript)
VoiceOverlay->>Chat: Send voice message
Chat->>Chat: Generate AI response
Chat->>Hook: Trigger TTS with aiResponse
Hook->>Synthesize: POST text + voice settings
Synthesize->>TTS: Request MP3 audio
TTS-->>Synthesize: MP3 stream
Synthesize-->>Hook: Return audio stream
Hook->>Audio: Play synthesis via Web Audio API
Audio->>User: Hear TTS response
Hook->>Hook: onSpeakComplete()
Hook->>VoiceOverlay: Resume listening (barge-in mode)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 11
🤖 Fix all issues with AI agents
In `@apps/web/src/app/api/voice/synthesize/route.ts`:
- Line 17: AUTH_OPTIONS currently disables CSRF/origin checks for
session-authenticated POSTs which allows cross-site requests to trigger TTS;
change AUTH_OPTIONS to requireCSRF: true (e.g., const AUTH_OPTIONS = { allow:
['session'] as const, requireCSRF: true }) and ensure the POST route handler
(route.ts POST handler) enforces CSRF validation and/or explicit Origin/Referer
header checks for same-site requests before using the user API key; if your auth
middleware exposes a CSRF check function, call it at the start of the handler
(or add explicit origin validation) to block cross-site POSTs.
- Around line 101-103: The code clamps `speed` into `clampedSpeed` without
ensuring `speed` is a finite number, which yields NaN for non-numeric input;
update the logic around the `speed` variable and `clampedSpeed` so you first
coerce/validate `speed` (e.g., parse/Number conversion) and verify
Number.isFinite(value) before clamping, and if invalid use a safe default (e.g.,
1.0) or return a 4xx error; modify the block that computes `clampedSpeed` in
route.ts so it checks finiteness and falls back to a valid number prior to
Math.min/Math.max.
- Around line 104-118: Replace the direct fetch call in route.ts with the Vercel
AI SDK TTS helper: import experimental_generateSpeech (alias generateSpeech)
from 'ai' and use openai.speech('tts-1') as the model; call generateSpeech({
model: openai.speech('tts-1'), text, voice, providerOptions: { openai: {
response_format: 'mp3', speed: clampedSpeed } } }) and extract audio, then
return new Response(audio.uint8Array, { headers: { 'Content-Type':
audio.mediaType || 'audio/mpeg' } }); ensure you remove the manual fetch and
keep using the existing variables model/text/voice/clampedSpeed where
appropriate.
In `@apps/web/src/app/api/voice/transcribe/route.ts`:
- Line 16: The route currently disables CSRF/origin checks via AUTH_OPTIONS;
change AUTH_OPTIONS to requireCSRF: true and keep session auth, and add an
explicit origin/referrer verification inside the POST handler (exported POST
function) to ensure the request Origin/Referer matches your app's allowed
origins (reject requests when header missing or mismatched). Update any related
tests or callers to include the CSRF token or proper origin header and ensure
the session-auth flow still obtains/validates the CSRF token before accepting
the POST.
- Around line 8-9: The comment in route.ts claims a fallback to checking an
OpenRouter key but the implementation only checks OpenAI; either update the
comment to remove the OpenRouter fallback mention or implement the fallback
logic: locate the API key lookup in the request handler (the code that currently
checks for the OpenAI key), and add a secondary check for an OpenRouter key name
(e.g., OPENROUTER_API_KEY or process.env.OPENROUTER_API_KEY) and use it where
appropriate; ensure the comment text referencing "OpenRouter fallback" is
adjusted to match the chosen approach.
- Around line 96-103: Replace the direct fetch call to OpenAI Whisper with the
Vercel AI SDK transcription API: remove the fetch block and call the SDK's
transcribe method (openai.transcription('whisper-1') / transcribe()) using the
same multipart form or file stream, handle the returned transcription result and
errors via the SDK's response, and ensure you pass the API key/config through
the SDK client initialization used elsewhere in this file (refer to
transcribe(), openai.transcription('whisper-1') and the surrounding route
handler in route.ts to locate where to swap the logic).
In
`@apps/web/src/components/layout/middle-content/page-views/ai-page/AiChatView.tsx`:
- Around line 371-405: handleVoiceSend is sending voice transcripts as { text }
which doesn't match the expected message shape; change the payload passed to
sendMessage inside handleVoiceSend to use the message parts schema: send message
content as { parts: [{ type: 'text', text }] } so extractMessageContent() can
parse it. Locate handleVoiceSend in AiChatView.tsx and update the first argument
to sendMessage from { text } to the parts object, ensuring other metadata
(chatId, conversationId, selectedProvider, etc.) remains unchanged.
In
`@apps/web/src/components/layout/middle-content/page-views/dashboard/GlobalAssistantView.tsx`:
- Around line 601-625: The voice handler handleVoiceSend is calling sendMessage
with a raw { text } payload which bypasses the required multipart structure;
change the message content to the parts format—call sendMessage with { parts: [{
type: 'text', text }] } and keep the existing requestBody (the second arg)
unchanged so downstream consumers receive the expected parts structure (update
the sendMessage invocation in handleVoiceSend accordingly).
In
`@apps/web/src/components/layout/right-sidebar/ai-assistant/SidebarChatTab.tsx`:
- Around line 585-613: handleVoiceSend currently calls sendMessage with a plain
{ text } payload which bypasses the required multipart message format; update
handleVoiceSend to call sendMessage with a parts payload instead (e.g. { parts:
[{ type: 'text', text }] }) while keeping the existing body construction logic
intact so downstream consumers receive the message parts structure; locate
handleVoiceSend and replace the first argument passed to sendMessage
accordingly, ensuring any other fields (isReadOnly, webSearchEnabled,
selectedProvider, etc.) remain unchanged.
In `@apps/web/src/hooks/useVoiceMode.ts`:
- Line 6: Replace the nanoid import and any nanoid() calls with CUID2: change
"import { nanoid } from 'nanoid'" to "import { cuid } from 'cuid2'" (or the
repo's cuid2 export) in useVoiceMode.ts and replace all uses of nanoid() that
generate audio IDs with cuid() so audio IDs follow the project's CUID2 standard
(also update the other occurrence noted at line 335 accordingly).
- Around line 137-141: The uploaded filename is hardcoded to "recording.webm"
while audioBlob.type may be "audio/webm" or "audio/mp4"; update the code that
builds the FormData in useVoiceMode (the formData.append call that uses
audioBlob) to derive the correct file extension from audioBlob.type (e.g., map
"audio/webm" -> ".webm", "audio/mp4" or "audio/mpeg" -> ".mp4") and use that
extension in the filename passed to formData.append('audio', audioBlob,
filename) so the filename matches the actual MIME type.
🧹 Nitpick comments (1)
apps/web/src/hooks/useVoiceMode.ts (1)
27-27: UseVoiceStateinstead ofstringforvoiceState.This keeps the hook return aligned with the store’s union type and improves type safety.
Suggested fix
-import { useVoiceModeStore, type TTSVoice } from '@/stores/useVoiceModeStore'; +import { useVoiceModeStore, type TTSVoice, type VoiceState } from '@/stores/useVoiceModeStore'; @@ - voiceState: string; + voiceState: VoiceState;
| // Call OpenAI TTS API | ||
| const response = await fetch('https://api.openai.com/v1/audio/speech', { | ||
| method: 'POST', | ||
| headers: { | ||
| 'Authorization': `Bearer ${openAISettings.apiKey}`, | ||
| 'Content-Type': 'application/json', | ||
| }, | ||
| body: JSON.stringify({ | ||
| model, | ||
| input: text, | ||
| voice, | ||
| speed: clampedSpeed, | ||
| response_format: 'mp3', | ||
| }), | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Vercel AI SDK text-to-speech TTS audio support OpenAI
💡 Result:
Vercel’s AI SDK does support OpenAI text-to-speech (TTS) via the experimental generateSpeech() API, using the OpenAI provider’s .speech() model factory. Example: model: openai.speech('tts-1') (or gpt-4o-mini-tts) and it returns an audio payload. [1][2]
Key options (OpenAI provider):
voice(e.g."alloy") [2]- Output format via
outputFormat(SDK) orproviderOptions.openai.response_format(mp3,opus,aac,flac,wav,pcm) [1][2] providerOptions.openai.speed(0.25–4.0) [1][2]providerOptions.openai.instructions(works withgpt-4o-mini-tts, nottts-1/tts-1-hd) [1][2]
What this is not: AI SDK support for chat/completions models generating audio outputs (e.g. “gpt-4o-audio-preview” style audio-out) has been requested and discussed as a separate feature area. [3]
Minimal code sample (AI SDK + OpenAI TTS):
import { experimental_generateSpeech as generateSpeech } from 'ai';
import { openai } from '@ai-sdk/openai';
const { audio } = await generateSpeech({
model: openai.speech('tts-1'),
text: 'Hello, world!',
voice: 'alloy',
providerOptions: { openai: { response_format: 'mp3', speed: 1.1 } },
});[1][2]
Sources:
[1] https://ai-sdk.dev/docs/guides/openai
[2] https://v5.ai-sdk.dev/docs/reference/ai-sdk-core/generate-speech
[3] vercel/ai#5007
🌐 Web query:
Vercel AI SDK generateText generateObject audio generation server-side
💡 Result:
generateText() / generateObject() in the Vercel AI SDK are for text and JSON/structured output—not for producing audio bytes. For server-side audio generation (TTS), AI SDK v5 added speech primitives: experimental_generateSpeech (and experimental_transcribe for STT). [1]
Server-side TTS (Next.js Route Handler)
// app/api/tts/route.ts
import { experimental_generateSpeech as generateSpeech } from 'ai';
import { openai } from '@ai-sdk/openai';
export async function POST(req: Request) {
const { text } = await req.json();
const { audio } = await generateSpeech({
model: openai.speech('tts-1'),
text,
voice: 'alloy',
});
return new Response(audio, {
headers: {
// adjust if you request a different output format from your provider/model
'Content-Type': 'audio/mpeg',
'Cache-Control': 'no-store',
},
});
}This uses the unified speech API pattern introduced in AI SDK 5. [1]
Alternative provider example (Hume via AI SDK)
Hume documents the same unified interface using experimental_generateSpeech, with model: hume.speech() and returning an audio payload (non-streaming). [2]
Note on “LLM generates audio as an output”
There was (at least as of early 2025) a gap in @ai-sdk/openai where audio input was accepted but configuring audio output from multimodal chat models (e.g., gpt-4o-audio-preview) wasn’t supported directly via the OpenAI provider parameters in AI SDK. [3] If your goal is specifically “chat completion that outputs audio,” check current provider support; otherwise, use generateSpeech() for TTS. [1]
Sources: [1] AI SDK 5 announcement (speech generation & transcription) • [2] Hume’s AI SDK integration guide • [3] Vercel/ai GitHub issue on audio outputs for gpt-4o-audio-preview
🌐 Web query:
"Vercel AI SDK" v2 API reference speech audio
💡 Result:
For Vercel AI SDK “SpeechModelV2” (the v2 speech model interface), the relevant API reference is experimental_generateSpeech / generateSpeech(), which returns an audio: GeneratedAudioFile (your “speech audio” object). (v5.ai-sdk.dev)
generateSpeech() (speech → audio) API reference
- Docs (latest):
AI SDK Core: generateSpeech(ai-sdk.dev) - Docs (AI SDK 5 / shows
SpeechModelV2):v5.ai-sdk.dev … generateSpeech(v5.ai-sdk.dev)
What the returned audio contains (GeneratedAudioFile)
base64: string– audio as base64uint8Array: Uint8Array– raw bytesmediaType: string– e.g.audio/mpegformat: string– e.g.mp3
(pluswarnings, optionalproviderMetadata, andresponsesmetadata) (ai-sdk.dev)
“Speech” overview page (how to access audio bytes)
The Speech guide shows accessing result.audio.uint8Array / result.audio.base64. (ai-sdk.dev)
Citations:
- 1: https://v5.ai-sdk.dev/docs/reference/ai-sdk-core/generate-speech?utm_source=openai
- 2: https://ai-sdk.dev/docs/reference/ai-sdk-core/generate-speech
- 3: https://v5.ai-sdk.dev/docs/reference/ai-sdk-core/generate-speech?utm_source=openai
- 4: https://ai-sdk.dev/docs/reference/ai-sdk-core/generate-speech
- 5: https://ai-sdk.dev/docs/ai-sdk-core/speech?utm_source=openai
Refactor to use Vercel AI SDK's experimental_generateSpeech() for TTS.
The Vercel AI SDK supports OpenAI text-to-speech via openai.speech('tts-1') with all options you're using (voice, response_format, speed). Replace the direct fetch with:
import { experimental_generateSpeech as generateSpeech } from 'ai';
import { openai } from '@ai-sdk/openai';
const { audio } = await generateSpeech({
model: openai.speech('tts-1'),
text,
voice,
providerOptions: {
openai: {
response_format: 'mp3',
speed: clampedSpeed,
},
},
});
return new Response(audio.uint8Array, {
headers: { 'Content-Type': audio.mediaType || 'audio/mpeg' },
});This aligns with the guideline: Use Vercel AI SDK with async/await for all AI operations.
🤖 Prompt for AI Agents
In `@apps/web/src/app/api/voice/synthesize/route.ts` around lines 104 - 118,
Replace the direct fetch call in route.ts with the Vercel AI SDK TTS helper:
import experimental_generateSpeech (alias generateSpeech) from 'ai' and use
openai.speech('tts-1') as the model; call generateSpeech({ model:
openai.speech('tts-1'), text, voice, providerOptions: { openai: {
response_format: 'mp3', speed: clampedSpeed } } }) and extract audio, then
return new Response(audio.uint8Array, { headers: { 'Content-Type':
audio.mediaType || 'audio/mpeg' } }); ensure you remove the manual fetch and
keep using the existing variables model/text/voice/clampedSpeed where
appropriate.
There was a problem hiding this comment.
Acknowledged. The experimental_generateSpeech API is still experimental and may change. We'll consider migrating to the AI SDK TTS in a future iteration once the API stabilizes. For now, the direct fetch approach provides reliable streaming audio playback.
There was a problem hiding this comment.
@2witstudios, understood — that's a reasonable approach given the experimental status. The direct fetch implementation works well for your current needs. Feel free to revisit the AI SDK integration when the API stabilizes. 👍
🧠 Learnings used
Learnt from: CR
Repo: 2witstudios/PageSpace PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-24T23:22:50.133Z
Learning: Applies to app/api/**/*.{ts,tsx} : Use `const body = await request.json();` to extract request bodies, `const { searchParams } = new URL(request.url);` for query parameters, and `return Response.json(data)` or `return NextResponse.json(data)` for responses
Learnt from: CR
Repo: 2witstudios/PageSpace PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-23T18:49:41.966Z
Learning: Applies to apps/web/src/app/**/route.{ts,tsx} : In Route Handlers, return JSON using `Response.json(data)` or `NextResponse.json(data)`
Learnt from: CR
Repo: 2witstudios/PageSpace PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-23T18:49:41.966Z
Learning: Applies to apps/web/src/**/*.{ts,tsx} : Use Vercel AI SDK with async/await for all AI operations and streaming
Learnt from: CR
Repo: 2witstudios/PageSpace PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-22T20:04:40.910Z
Learning: Applies to **/*ai*.{ts,tsx} : Use Vercel AI SDK for AI integrations
Learnt from: 2witstudios
Repo: 2witstudios/PageSpace PR: 91
File: apps/web/src/components/ai/ui/Image.tsx:2-2
Timestamp: 2025-12-16T19:06:20.385Z
Learning: In apps/web/src/components/ai/ui/Image.tsx (TypeScript/React), the intentional use of `Experimental_GeneratedImage` from the Vercel AI SDK is accepted. This type is the correct and intended way to handle AI-generated images with base64/mediaType properties, and will be updated when the AI SDK stabilizes this API.
Learnt from: CR
Repo: 2witstudios/PageSpace PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-20T17:23:53.244Z
Learning: Tech stack: Next.js 15 App Router + TypeScript + Tailwind + shadcn/ui (frontend), PostgreSQL + Drizzle ORM (database), Ollama + Vercel AI SDK + OpenRouter + Google AI SDK (AI), custom JWT auth, local filesystem storage, Socket.IO for real-time, Docker deployment
apps/web/src/components/layout/middle-content/page-views/ai-page/AiChatView.tsx
Show resolved
Hide resolved
apps/web/src/components/layout/middle-content/page-views/dashboard/GlobalAssistantView.tsx
Outdated
Show resolved
Hide resolved
apps/web/src/components/layout/right-sidebar/ai-assistant/SidebarChatTab.tsx
Show resolved
Hide resolved
Security fixes: - Enable CSRF protection on /api/voice/synthesize and /api/voice/transcribe routes - Add speed input validation to prevent NaN from invalid input Code quality fixes: - Use message parts structure for voice transcripts in all chat views - Replace nanoid with cuid2 for audio ID generation (repo standard) - Match audio filename extension to actual MIME type - Remove outdated OpenRouter fallback comment Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Adds comprehensive voice mode functionality to the chat interface, enabling hands-free interaction through OpenAI's Whisper (speech-to-text) and TTS (text-to-speech) APIs. Users can now have natural voice conversations with the assistant.
Key Changes
API Routes
/api/voice/transcribe- Converts audio to text using OpenAI Whisper API/api/voice/synthesize- Converts text to speech using OpenAI TTS APIComponents
VoiceModeOverlay- Full-screen overlay for voice interactionVoiceModeSettings- Configuration panel for voice preferencesHooks & State Management
useVoiceMode- Main hook managing voice interaction lifecycleuseVoiceModeStore- Zustand store for voice mode stateUI Integration
Implementation Details
Two Interaction Modes
Audio Processing
Error Handling
Browser Compatibility
https://claude.ai/code/session_0126CHZ5h1Gnv5kKT4TaUFK2
Summary by CodeRabbit
Release Notes
New Features