Skip to content

feat(audio): live speech-to-text streaming transcriber#77

Open
VineeTagarwaL-code wants to merge 24 commits intomainfrom
feat/live-stt
Open

feat(audio): live speech-to-text streaming transcriber#77
VineeTagarwaL-code wants to merge 24 commits intomainfrom
feat/live-stt

Conversation

@VineeTagarwaL-code
Copy link
Copy Markdown
Collaborator

Summary

Adds jigsaw.audio.speech_to_text_live(config?) — a live streaming transcriber that accepts a WritableStream of mono PCM16 audio and emits open / delta / turn / warning / error / close events. Internally handles WAV framing, chunk overlap, token-level stitching, and SSE parsing against /v1/ai/transcribe?stream=true.

Public API

import { Readable } from "stream";
import recorder from "node-record-lpcm16";
import { JigsawStack } from "jigsawstack";

const jigsaw = JigsawStack({ apiKey: process.env.JIGSAWSTACK_API_KEY });
const transcriber = jigsaw.audio.speech_to_text_live({ sampleRate: 16000 });

transcriber.on("delta", ({ text }) => process.stdout.write(`\r… ${text}`));
transcriber.on("turn",  ({ text }) => console.log(`\n${text}`));

await transcriber.connect();

const rec = recorder.record({ sampleRate: 16000, channels: 1, audioType: "raw" });
Readable.toWeb(rec.stream()).pipeTo(transcriber.stream());

process.on("SIGINT", async () => { rec.stop(); await transcriber.close(); process.exit(); });

Full working example at examples/live-mic.js.

Design

  • Chunker buffers PCM16 bytes with configurable overlap (default 5s chunks, 2s overlap), produces WAV chunks, supports buffer-overflow frame dropping (warning emitted).
  • Stitcher detects token-level overlap between adjacent chunks with fuzzy matching (single-char substitution / insertion-deletion on tokens ≥ 4 chars) and strips duplicates, preserving punctuation that's genuinely new.
  • SSE transport posts WAV + parses transcript.delta / transcript.done / transcript.final events via a new RequestClient.fetchJSSStream method.
  • Transcriber composes the above: state machine (idle → open → closing → closed, with errored → closed branch), serial chunk processing (one in-flight at a time, required for ordered stitching), per-chunk 30s AbortController timeout, 3 consecutive chunk failures → fatal.

Key decisions

  • Streaming is English-only per JigsawStack docs — language is not exposed on LiveSTTConfig, always sent as en internally.
  • Mono audio requiredchannels is not exposed; users with stereo sources must downmix before piping (see downmixToMono helper in tests/audio-live.test.ts). Exposing channels was a correctness hazard if mismatched with actual audio.
  • No mic library in SDK deps — users bring their own audio source (browser MediaStream, Node node-record-lpcm16, file, etc.). Keeps the SDK bundle lean and platform-agnostic.
  • Serial chunk requests, not parallel — overlap stitching needs prevTranscript state ordering; parallel requests would produce out-of-order stitches.
  • close event fires exactly once — guaranteed via emitClose() helper, including after fatal errors.

Config

interface LiveSTTConfig {
  translate?: boolean;       // default false
  vad?: boolean;             // default true
  vadThreshold?: number;     // default 0.4
  sampleRate?: number;       // default 16000
  chunkSeconds?: number;     // default 5
  overlapSeconds?: number;   // default 2
  maxBufferSeconds?: number; // default 30
}

Test plan

  • 36 unit tests across tests/live/ (chunker, stitcher, sse, transcriber, request-stream). All green.
  • yarn test:all glob extended to include tests/live/*.ts so they run in CI.
  • Build clean (yarn build).
  • Biome formatter applied.
  • Opt-in live integration test at tests/audio-live.test.ts (gated on JIGSAWSTACK_API_KEY, runnable via yarn test:audio:live) — passed against staging.
  • Manual verification via real mic against the live API (see demo walkthrough in the feature branch's dev history).

Docs

  • Design spec: docs/superpowers/specs/2026-04-21-live-stt-design.md
  • Implementation plan: docs/superpowers/plans/2026-04-21-live-stt.md
  • README section added
  • examples/live-mic.js reference

Backward compatibility

Purely additive. Existing speech_to_text, SpeechToTextParams, SpeechToTextResponse untouched.

🤖 Generated with Claude Code

VineeTagarwaL-code and others added 24 commits April 21, 2026 06:04
Design for a streaming transcriber (jigsaw.audio.speech_to_text_live)
that accepts a WritableStream of PCM16 audio, internally chunks with
overlap, and stitches SSE transcripts into delta/turn events.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removed from config, types, stitcher responsibility, file-tree comment,
and stitcher test list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TDD plan in 10 tasks: fetchJSSStream extension, Chunker, Stitcher,
SSE transport, LiveSTT types, Transcriber class, API wiring, cleanup
(drop node-record-lpcm16 from deps), opt-in integration test, README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@VineeTagarwaL-code
Copy link
Copy Markdown
Collaborator Author

When its ready to merge we will delete docs, it will come in handy while reviewing or doing more over it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant