Skip to content

Conversation

@pranavjoshi001
Copy link

@pranavjoshi001 pranavjoshi001 commented Dec 12, 2025

Changelog Entry

  • Added Speech-to-Speech (S2S) support for real-time voice conversations, in PR #5654, by @pranavjoshi

Description

This PR introduces Speech-to-Speech (S2S) functionality in Web Chat, enabling real-time voice conversations with bots. The implementation includes audio recording via AudioWorklet, audio playback with buffer queueing, and speech state management. This foundation supports upcoming MMRT (Multi-Modal Real-Time), ABS (Azure Bot Service), and CCV2 integration changes.

Activity structure - microsoft/Agents#377

Design

The Speech-to-Speech feature is built on three main components:

  1. Voice State Management (voiceActivity reducer) - Manages:

    • voiceState: Current speech state (idle, listening, user_speaking, processing, bot_speaking)
    • voiceHandlers: Registered audio handler functions (supports multiple handlers)
  2. SpeechToSpeech Provider (SpeechToSpeechComposer.tsx) - A React component that manages:

    • VoiceHandlerBridge - Registers audio playback functions (queueAudio, stopAllAudio) with Redux
    • VoiceRecorderBridge - Bridges Redux voice state with microphone recording, sends audio chunks via postVoiceActivity
  3. Exposed control hooks:

    • useVoiceStart.ts - Hook to start s2s interaction
    • useVoiceStop.ts - Hook to stop s2s interaction (mic + audio stop)
    • useVoiceState.ts - Current state of voice interaction

Speech State Flow

idle → listening → user_speaking → processing → bot_speaking → listening

Voice Activity Flow (Fire-and-Forget Pattern)

Outgoing (User → Bot):

  • User speech captured via AudioWorklet → postVoiceActivity action → postVoiceActivitySaga → DirectLine (no Redux storage)

Incoming (Bot → User):

  • DirectLine activity$ → observeActivitySaga → calls voiceHandlers.queueAudio() directly (no Redux storage)
  • Only transcript activities go through standard activity pipeline for rendering

Performance Optimization

Voice activities use a fire-and-forget pattern to optimize performance:

  • No Storage: Voice chunks (stream.chunk) are NOT stored in Redux - they flow directly to/from audio handlers
  • Function References: Redux stores handler functions (queueAudio, stopAllAudio), not data
  • Separate Saga: postVoiceActivitySaga sends without waiting for echo-back or dispatching PENDING/FULFILLED actions
  • Reduced Overhead: Prevents clogging the main activities array with high-frequency voice events
  • Selective Processing: Only voice transcript activities (which need rendering) go through the standard activity pipeline

Specific Changes

New Files Added:

Core Utilities (packages/core)

  • isVoiceActivity.ts - Type guard for voice/DTMF activities
  • isVoiceTranscriptActivity.ts - Type guard for transcript activities
  • getVoiceActivityRole.ts - Extract role (user/bot) from voice activity
  • getVoiceActivityText.ts - Extract transcription text from voice activity

Actions (packages/core/src/actions)

  • setVoiceState.ts - Set voice state action
  • startVoiceRecording.ts - Start recording action (transitions to listening)
  • stopVoiceRecording.ts - Stop recording action (transitions to idle)
  • registerVoiceHandler.ts - Register audio handler with unique ID
  • unregisterVoiceHandler.ts - Unregister audio handler by ID
  • postVoiceActivity.ts - Fire-and-forget voice activity posting

Reducer (packages/core/src/reducers)

  • voiceActivity.ts - Manages voiceState and voiceHandlers (Map<string, VoiceHandler>)

Sagas (packages/core/src/sagas)

  • postVoiceActivitySaga.ts - Handles outgoing voice activities (fire-and-forget)
  • Updated observeActivitySaga.ts - Routes incoming voice activities to handlers

Provider & Hooks (packages/api)

  • SpeechToSpeechComposer.tsx - Main S2S provider (integrated into Composer)
  • VoiceHandlerBridge.tsx - Registers audio player with Redux
  • VoiceRecorderBridge.tsx - Bridges recording state with microphone
  • useRecorder.ts - AudioWorklet-based recording (CSP compliant)
  • useAudioPlayer.ts - Audio playback with buffer queueing
  • useVoiceHandlers.ts - Hook to get registered voice handlers
  • useRegisterVoiceHandler.ts - Hook to register a voice handler (returns unregister function)
  • useSetVoiceState.ts - Hook to set voice state
  • usePostVoiceActivity.ts - Hook to post voice activities

Test Coverage Added:

  • Unit tests for useRecorder and useAudioPlayer hooks
  • E2E HTML tests covering:
    • Happy path conversation flow
    • Barge-in/interruption handling
    • CSP compliance for AudioWorklet
    • Audio chunk timing and intervals
  • I have added tests and executed them locally
  • I have updated CHANGELOG.md
  • I have updated documentation

Review Checklist

This section is for contributors to review your work.

  • Accessibility reviewed (tab order, content readability, alt text, color contrast)
  • Browser and platform compatibilities reviewed
  • CSS styles reviewed (minimal rules, no z-index)
  • Documents reviewed (docs, samples, live demo)
  • Internationalization reviewed (strings, unit formatting)
  • package.json and package-lock.json reviewed
  • Security reviewed (no data URIs, check for nonce leak)
  • Tests reviewed (coverage, legitimacy)

@pranavjoshi001 pranavjoshi001 changed the title Feature/core s2s composer Core speech to speech composer implementation (no-op code) Dec 12, 2025
@pranavjoshi001 pranavjoshi001 changed the title Core speech to speech composer implementation (no-op code) Core speech to speech implementation Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants