Skip to content

[Example] 530 — Silero VAD Speech Segmentation with Deepgram STT (Python)#208

Merged
lukeocodes merged 1 commit intomainfrom
example/530-silero-vad-speech-segmentation-python
Apr 11, 2026
Merged

[Example] 530 — Silero VAD Speech Segmentation with Deepgram STT (Python)#208
lukeocodes merged 1 commit intomainfrom
example/530-silero-vad-speech-segmentation-python

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot commented Apr 8, 2026

New example: Silero VAD Speech Segmentation with Deepgram STT

Integration: Silero VAD | Language: Python | Products: STT

What this shows

Demonstrates how to use Silero VAD (Voice Activity Detection) to detect speech regions in an audio file, extract each segment, and transcribe them individually with Deepgram. This covers a common pre-processing pipeline: detect speech boundaries locally with Silero VAD, slice the waveform, and send each speech chunk to Deepgram's nova-3 model for transcription.

Required secrets

None — only DEEPGRAM_API_KEY required

Tests

✅ Tests passed

── detect_speech_regions ──
  Found 4 speech region(s)
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Results: 4 passed, 0 failed

Built by Engineer on 2026-04-08

@github-actions
Copy link
Copy Markdown
Contributor Author

github-actions bot commented Apr 8, 2026

Code Review

Overall: APPROVED

Tests ran ✅

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s
    2.79s - 4.32s
    4.51s - 12.54s
    12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
Detected 4 speech region(s). Transcribing...
  Transcribed 4 segment(s), 331 total chars
    [0.1s-2.7s] conf=1.00 'Yeah. As as much as, it's worth'
    [2.8s-4.3s] conf=0.53 'Celebrating'
    [4.5s-12.5s] conf=1.00 'The first, spacewalk, with an all female team...'
    [12.7s-25.4s] conf=1.00 'And, I think if it signifies anything, it is, to honor the t...'
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Results: 4 passed, 0 failed

Integration genuineness

Pass — Silero VAD is a local audio processing library (no cloud API). The example correctly:

  1. Imports and uses silero_vad (load_silero_vad, get_speech_timestamps, read_audio)
  2. Makes real VAD calls on actual audio — not mocked or hardcoded
  3. .env.example appropriately lists only DEEPGRAM_API_KEY (Silero VAD has no credentials)
  4. Tests exit code 2 on missing credentials, real Deepgram API calls in e2e test
  5. Audio flows through Silero VAD first (segmentation), then each segment goes to Deepgram — correct pipeline pattern
  6. No raw WebSocket/fetch calls — uses official deepgram-sdk

Code quality

  • ✅ Official Deepgram SDK (deepgram-sdk==6.1.1) — correct version
  • tag="deepgram-examples" present on all Deepgram API calls
  • ✅ No hardcoded credentials
  • ✅ Error handling: credential check in main(), file existence check, proper exit codes
  • ✅ Tests import from src/ and call the example's actual functions (detect_speech_regions, extract_segment_bytes, process_audio)
  • ✅ Transcript assertions use length/duration proportionality (min_chars = max(5, audio_duration_sec * 2)) — no word lists
  • ✅ Credential check runs before SDK imports in tests (lines 13-22 before from segmenter import ...)

Documentation

  • ✅ README: "What you'll build" section, env vars with links, install/run instructions, CLI options, key parameters, how it works
  • .env.example present and complete

✓ All checks pass. Ready for merge.


Review by Lead on 2026-04-08

@github-actions github-actions bot added the status:review-passed Self-review passed label Apr 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor Author

github-actions bot commented Apr 9, 2026

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED
tests/test_example.py::test_extract_segment_bytes PASSED
tests/test_example.py::test_process_audio_end_to_end PASSED
tests/test_example.py::test_vad_parameters_affect_output PASSED

4 passed, 12 warnings in 5.91s

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s
    2.79s - 4.32s
    4.51s - 12.54s
    12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
    [0.1s-2.7s] conf=1.00
    [2.8s-4.3s] conf=0.53
    [4.5s-12.5s] conf=1.00
    [12.7s-25.4s] conf=1.00
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Results: 4 passed, 0 failed

Integration genuineness

Pass.

  1. ✅ Silero VAD SDK imported and used (from silero_vad import get_speech_timestamps, load_silero_vad, read_audio)
  2. ✅ Real VAD calls made — get_speech_timestamps() runs locally on audio waveform
  3. .env.example lists DEEPGRAM_API_KEY (Silero VAD runs locally — no API key needed for it)
  4. ✅ Tests exit with code 2 on missing credentials
  5. ✅ BYPASS CHECK: Silero VAD is a local pre-processing library (not a Deepgram audio interface), so DeepgramClient direct usage is correct — audio flows through Silero VAD segmentation first, then each segment is transcribed via Deepgram
  6. ✅ NO RAW PROTOCOL CHECK: Uses client.listen.v1.media.transcribe_file() — no raw WebSocket/fetch

Code quality

  • ✅ Official Deepgram SDK (deepgram-sdk==6.1.1 — matches required v6.1.1)
  • tag="deepgram-examples" present on Deepgram API call
  • ✅ No hardcoded credentials
  • ✅ Error handling for missing credentials and file not found
  • ✅ Tests import from src/ and call the example's actual functions
  • ✅ Transcript assertions use length/duration proportionality (min_chars = max(5, audio_duration_sec * 2)) — no word lists
  • ✅ Credential check runs first (top of test file, before SDK imports)
  • ⚠️ Minor: import torchaudio in segmenter.py line 24 is unused (silero-vad uses it internally, but the import in this file is dead code). Please remove it.

Documentation

  • ✅ README has "What you'll build" section
  • ✅ Environment variables table with console link
  • ✅ Install and run instructions
  • ✅ CLI options and key parameters documented
  • .env.example present and complete

✓ All checks pass. Ready for merge.

One minor nit: remove the unused import torchaudio from src/segmenter.py.


Review by Lead on 2026-04-09

@github-actions
Copy link
Copy Markdown
Contributor Author

github-actions bot commented Apr 9, 2026

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED
tests/test_example.py::test_extract_segment_bytes PASSED
tests/test_example.py::test_process_audio_end_to_end PASSED
tests/test_example.py::test_vad_parameters_affect_output PASSED

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s
    2.79s - 4.32s
    4.51s - 12.54s
    12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
    [0.1s-2.7s] conf=1.00
    [2.8s-4.3s] conf=0.53
    [4.5s-12.5s] conf=1.00
    [12.7s-25.4s] conf=1.00
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Results: 4 passed, 0 failed

Integration genuineness

Pass — All 6 checks passed:

  1. Silero VAD SDK imported and used (get_speech_timestamps, load_silero_vad, read_audio)
  2. Real VAD call made on actual audio — not mocked or hardcoded
  3. .env.example lists DEEPGRAM_API_KEY (Silero VAD is local, no API key needed — legitimate)
  4. Tests exit with code 2 if credentials missing
  5. Bypass check passed — Silero VAD is a local speech detection model (not a Deepgram wrapper), so DeepgramClient usage for transcription is correct
  6. No raw protocol — all Deepgram API contact uses the official SDK

Code quality

  • ✅ Official Deepgram SDK (deepgram-sdk==6.1.1) — correct pinned version
  • tag="deepgram-examples" present on Deepgram API call
  • ✅ No hardcoded credentials
  • ✅ Error handling covers missing API key, missing file, empty speech detection
  • ✅ Tests import from src/ and call the example's actual functions
  • ✅ Transcript assertions use length/duration proportionality — no specific word lists
  • ✅ Credential check runs first (before SDK imports that could throw)

Documentation

  • ✅ README includes "What you'll build", env vars table with console link, install/run instructions, CLI options, parameter table, and architecture explanation
  • .env.example present and complete

✓ All checks pass. Ready for merge.


Review by Lead on 2026-04-09

@github-actions
Copy link
Copy Markdown
Contributor Author

github-actions bot commented Apr 9, 2026

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED                 [ 25%]
tests/test_example.py::test_extract_segment_bytes PASSED                 [ 50%]
tests/test_example.py::test_process_audio_end_to_end PASSED              [ 75%]
tests/test_example.py::test_vad_parameters_affect_output PASSED          [100%]

4 passed, 12 warnings in 6.87s

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s / 2.79s - 4.32s / 4.51s - 12.54s / 12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Integration genuineness

Pass — All 6 checks passed:

  1. Silero VAD SDK imported (from silero_vad import get_speech_timestamps, load_silero_vad, read_audio)
  2. Real VAD calls: get_speech_timestamps() runs on actual audio waveform
  3. .env.example lists DEEPGRAM_API_KEY — correct since Silero VAD is a local library (no API key needed)
  4. Tests exit 2 on missing credentials (test_example.py:20-22)
  5. Bypass check: Pass — Silero VAD is a local preprocessing library (not an audio/speech interface wrapping Deepgram), so DeepgramClient usage for transcription is the correct pattern
  6. No raw protocol: Pass — all Deepgram calls use the official SDK via client.listen.v1.media.transcribe_file()

Code quality

  • ✅ Official Deepgram SDK: deepgram-sdk==6.1.1 (matches required version)
  • tag="deepgram-examples" present on Deepgram API call (segmenter.py:106)
  • ✅ No hardcoded credentials
  • ✅ Error handling: credential check + file existence check in main()
  • ✅ Tests import from src/ and call the example's actual functions (detect_speech_regions, extract_segment_bytes, process_audio)
  • ✅ Transcript assertions use length/duration proportionality (min_chars based on audio_duration_sec * 2) — no specific word lists
  • ✅ Credential check runs first in test (module-level, before any SDK imports that could throw)
  • ⚠️ Minor: import torchaudio (segmenter.py:24) is unused — silero_vad.read_audio handles audio loading. Harmless since torchaudio is a silero_vad dependency anyway, but could be removed for cleanliness.

Documentation

  • ✅ README has "What you'll build" section
  • ✅ Environment variables table with link to Deepgram console
  • ✅ Install and run instructions with CLI options
  • ✅ Key parameters table
  • ✅ "How it works" explanation
  • .env.example present and complete

✓ All checks pass. Ready for merge.

Note: Attempted to push a minor cleanup (remove unused torchaudio import) but lacked push permissions. This is non-blocking.


Review by Lead on 2026-04-09

@github-actions
Copy link
Copy Markdown
Contributor Author

github-actions bot commented Apr 9, 2026

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED
tests/test_example.py::test_extract_segment_bytes PASSED
tests/test_example.py::test_process_audio_end_to_end PASSED
tests/test_example.py::test_vad_parameters_affect_output PASSED

4 passed, 0 failed

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s
    2.79s - 4.32s
    4.51s - 12.54s
    12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
    [0.1s-2.7s] conf=1.00 'Yeah. As as much as, it's worth'
    [2.8s-4.3s] conf=0.53 'Celebrating'
    [4.5s-12.5s] conf=1.00 'The first, spacewalk, with an all female team...'
    [12.7s-25.4s] conf=1.00 'And, I think if it signifies anything, it is...'
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Integration genuineness

Pass — All 6 checks satisfied:

  1. Silero VAD SDK imported and used (get_speech_timestamps, load_silero_vad, read_audio)
  2. Real VAD inference runs on actual audio — not mocked or hardcoded
  3. .env.example lists DEEPGRAM_API_KEY (Silero VAD is a local library — no API key needed)
  4. Tests exit(2) on missing credentials before any SDK imports
  5. Bypass check OK: Silero VAD is a local processing library with no Deepgram audio interface — DeepgramClient usage is appropriate for sending VAD-segmented chunks to Deepgram
  6. No raw protocol: All Deepgram contact goes through deepgram-sdk

Code quality

  • deepgram-sdk==6.1.1 — matches required Python SDK version
  • tag="deepgram-examples" present on every Deepgram API call
  • ✅ No hardcoded credentials; credential check in main() and in tests
  • ✅ Error handling covers missing key and missing file
  • ✅ Tests import from src/ and call the example's actual functions (detect_speech_regions, extract_segment_bytes, process_audio)
  • ✅ Transcript assertions use length/duration proportionality (audio_duration_sec * 2 min chars), not specific word lists
  • ✅ Credential check runs FIRST in tests (lines 13-22) before any SDK imports
  • ℹ️ Minor: import torchaudio (line 24) and soundfile in requirements.txt are not directly used in segmenter.py — they're transitive dependencies for silero-vad. Not a blocker.

Documentation

  • ✅ README has "What you'll build", env vars table with console links, install/run instructions, CLI options, and "How it works" explanation
  • .env.example present and complete

✓ All checks pass. Ready for merge.


Review by Lead on 2026-04-09

@github-actions
Copy link
Copy Markdown
Contributor Author

github-actions bot commented Apr 9, 2026

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED                 [ 25%]
tests/test_example.py::test_extract_segment_bytes PASSED                 [ 50%]
tests/test_example.py::test_process_audio_end_to_end PASSED              [ 75%]
tests/test_example.py::test_vad_parameters_affect_output PASSED          [100%]

4 passed, 12 warnings in 7.99s

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s
    2.79s - 4.32s
    4.51s - 12.54s
    12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Integration genuineness

Pass — All 6 checks satisfied:

  1. Silero VAD SDK imported and used (silero_vad.get_speech_timestamps, load_silero_vad, read_audio)
  2. Real VAD inference runs locally on audio waveform — not mocked
  3. .env.example lists DEEPGRAM_API_KEY; Silero VAD is a local model requiring no API key — correct
  4. Tests exit(2) on missing credentials before any SDK imports
  5. Bypass check pass: Silero VAD is a local pre-processing library with no Deepgram interface — audio correctly flows through Silero for segmentation, then each segment is sent to Deepgram via the SDK
  6. No raw protocol: All Deepgram API contact uses DeepgramClient; no raw WebSocket/fetch calls

Code quality

  • ✅ Official Deepgram SDK (deepgram-sdk==6.1.1) — correct version
  • tag="deepgram-examples" present on all Deepgram API calls (segmenter.py:106)
  • ✅ No hardcoded credentials
  • ✅ Error handling: credential check and file existence check in main()
  • ✅ Tests import from src/ and call actual example functions (detect_speech_regions, extract_segment_bytes, process_audio)
  • ✅ Transcript assertions use length/duration proportionality (audio_duration_sec * 2 min chars) — no specific word lists
  • ✅ Credential check runs first in tests (lines 13–22) before any src imports

Documentation

  • ✅ README includes "What you'll build", env vars table with console links, install/run instructions, CLI options, and how-it-works section
  • .env.example present and complete

✓ All checks pass. Ready for merge.


Review by Lead on 2026-04-09

@github-actions
Copy link
Copy Markdown
Contributor Author

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED
tests/test_example.py::test_extract_segment_bytes PASSED
tests/test_example.py::test_process_audio_end_to_end PASSED
tests/test_example.py::test_vad_parameters_affect_output PASSED

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s / 2.79s - 4.32s / 4.51s - 12.54s / 12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

4 passed, 0 failed

Integration genuineness

Pass — All 6 checks satisfied:

  1. Silero VAD SDK imported and used (load_silero_vad, get_speech_timestamps, read_audio)
  2. Real VAD inference runs on actual audio — not mocked or hardcoded
  3. .env.example complete (Silero VAD is local-only, no extra credentials needed)
  4. Tests exit 2 on missing credentials, make real Deepgram API calls
  5. Bypass check: Silero VAD is a pre-processing library with no Deepgram interface — DeepgramClient is correctly used directly for transcription
  6. No raw WebSocket/fetch calls — official deepgram-sdk used throughout

Code quality

  • deepgram-sdk==6.1.1 — correct pinned version
  • tag="deepgram-examples" present on all Deepgram API calls
  • ✅ No hardcoded credentials
  • ✅ Error handling: credential check + file existence check in main()
  • ✅ Tests import from src/ and call the example's actual functions
  • ✅ Transcript assertions use length/duration proportionality — no word-list flakiness
  • ✅ Credential check runs first (module-level in test file, before any SDK imports)
  • ⚠️ Minor: import torchaudio in segmenter.py is unused (torchaudio is used internally by silero_vad, but the explicit import is unnecessary). Non-blocking.

Documentation

  • ✅ README includes: what you'll build, env vars with console links, install/run instructions, CLI options, and how-it-works explanation
  • .env.example present and complete

✓ All checks pass. Ready for merge.

Minor nit: Remove the unused import torchaudio from src/segmenter.py (line 24). Could not push fix due to permissions.


Review by Lead on 2026-04-10

@github-actions
Copy link
Copy Markdown
Contributor Author

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED                 [ 25%]
tests/test_example.py::test_extract_segment_bytes PASSED                 [ 50%]
tests/test_example.py::test_process_audio_end_to_end PASSED              [ 75%]
tests/test_example.py::test_vad_parameters_affect_output PASSED          [100%]

======================== 4 passed, 12 warnings in 7.80s ========================

Detailed output:
── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s / 2.79s - 4.32s / 4.51s - 12.54s / 12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Integration genuineness

Pass — Silero VAD SDK is imported and used for real speech detection (get_speech_timestamps, load_silero_vad, read_audio). Audio flows through Silero VAD for segmentation, then each extracted segment is sent to Deepgram via the official SDK. Silero VAD is a local model (no API key required), so .env.example correctly only lists DEEPGRAM_API_KEY. No bypass — no raw WebSocket/fetch calls.

Code quality

  • ✅ Official Deepgram SDK deepgram-sdk==6.1.1 (current required version)
  • tag="deepgram-examples" present on Deepgram API calls
  • ✅ No hardcoded credentials
  • ✅ Error handling: credential check + file existence check in main()
  • ✅ Tests import from src/ and call actual example functions
  • ✅ Transcript assertions use length/duration proportionality (not word lists)
  • ✅ Credential check runs first (exit 2 on missing creds) before SDK imports
  • ⚠️ Minor: import torchaudio (line 24) is unused directly — it's a transitive dependency of silero_vad. Non-blocking.

Documentation

  • ✅ README has "What you'll build", env vars table with console links, install/run instructions, CLI options, key parameters, and "How it works" explanation
  • .env.example present and complete

✓ All checks pass. Ready for merge.


Review by Lead on 2026-04-10

@github-actions
Copy link
Copy Markdown
Contributor Author

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED
tests/test_example.py::test_extract_segment_bytes PASSED
tests/test_example.py::test_process_audio_end_to_end PASSED
tests/test_example.py::test_vad_parameters_affect_output PASSED

4 passed, 0 failed

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s | 2.79s - 4.32s | 4.51s - 12.54s | 12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Integration genuineness

Pass. Silero VAD SDK is imported and used for real speech detection (load_silero_vad, get_speech_timestamps, read_audio). VAD runs locally on real audio, segments are extracted, then each is sent to Deepgram for transcription. No bypass — audio flows through Silero VAD's processing pipeline before reaching Deepgram. No raw WebSocket/fetch calls. Credential check exits with code 2 when missing.

Code quality

  • ✅ Official Deepgram SDK (deepgram-sdk==6.1.1) — correct version
  • tag="deepgram-examples" present on all Deepgram API calls
  • ✅ No hardcoded credentials
  • ✅ Error handling: credential check + file existence check in main()
  • ✅ Tests import from src/ and call the example's actual functions (detect_speech_regions, extract_segment_bytes, process_audio)
  • ✅ Transcript assertions use length/duration proportionality (min_chars = max(5, audio_duration_sec * 2)) — no word lists
  • ✅ Credential check runs before SDK imports in tests (module-level check at top of test file with sys.exit(2))
  • ⚠️ Minor: import torchaudio (segmenter.py:24) is unused — it's a transitive dependency of silero_vad but not directly referenced. Safe to remove.

Documentation

  • ✅ README includes "What you'll build", env vars table with console link, install/run instructions, CLI options, and "How it works" explanation
  • .env.example present and complete

✓ All checks pass. Ready for merge.

Note: Attempted to push a fix for the unused torchaudio import but write access was denied. This is a cosmetic issue only and does not block approval.


Review by Lead on 2026-04-10

@github-actions
Copy link
Copy Markdown
Contributor Author

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED
tests/test_example.py::test_extract_segment_bytes PASSED
tests/test_example.py::test_process_audio_end_to_end PASSED
tests/test_example.py::test_vad_parameters_affect_output PASSED

4 passed, 0 failed

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s
    2.79s - 4.32s
    4.51s - 12.54s
    12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
    [0.1s-2.7s] conf=1.00
    [2.8s-4.3s] conf=0.53
    [4.5s-12.5s] conf=1.00
    [12.7s-25.4s] conf=1.00
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Integration genuineness ✅

  • Silero VAD SDK imported and used (get_speech_timestamps, load_silero_vad, read_audio)
  • Real VAD inference runs locally on actual audio — not mocked or hardcoded
  • Deepgram SDK used via client.listen.v1.media.transcribe_file() — no raw HTTP/WebSocket
  • Architecture is correct: Silero VAD is a local pre-processing library (no cloud API), so DeepgramClient is used directly to transcribe VAD-extracted segments
  • No bypass — audio flows through Silero VAD segmentation before reaching Deepgram

Code quality ✅

  • deepgram-sdk==6.1.1 — matches required version
  • tag="deepgram-examples" present on Deepgram API call
  • No hardcoded credentials
  • Error handling: missing API key check and file existence check in main()
  • Tests import from src/ and test the example's actual exported functions
  • Transcript assertions use length/duration proportionality (audio_duration_sec * 2 min chars) — no word-list assertions
  • Credential check exits with code 2 before any SDK calls in tests

Documentation ✅

  • README includes what you'll build, env vars with console links, install/run instructions, CLI options, and architecture overview
  • .env.example present with DEEPGRAM_API_KEY

✓ All checks pass. Ready for merge.


Review by Lead on 2026-04-10

@github-actions
Copy link
Copy Markdown
Contributor Author

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED
tests/test_example.py::test_extract_segment_bytes PASSED
tests/test_example.py::test_process_audio_end_to_end PASSED
tests/test_example.py::test_vad_parameters_affect_output PASSED

4 passed, 0 failed (5.69s)

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s / 2.79s - 4.32s / 4.51s - 12.54s / 12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
    [0.1s-2.7s] conf=1.00 'Yeah. As as much as, it's worth'
    [2.8s-4.3s] conf=0.53 'Celebrating'
    [4.5s-12.5s] conf=1.00 'The first, spacewalk, with an all female team...'
    [12.7s-25.4s] conf=1.00 'And, I think if it signifies anything, it is...'
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Integration genuineness

Pass — All 6 checks pass:

  1. Silero VAD SDK imported and used (load_silero_vad, get_speech_timestamps, read_audio)
  2. Real VAD calls made on actual audio — not mocked or hardcoded
  3. .env.example lists DEEPGRAM_API_KEY (Silero VAD is a local library, no API key needed)
  4. Tests exit with code 2 if credentials missing
  5. Bypass check: N/A — Silero VAD is a local pre-processing library, not a partner wrapping Deepgram. Direct DeepgramClient use is correct for the transcription step.
  6. No raw protocol: All Deepgram API contact uses the official SDK

Code quality

  • deepgram-sdk==6.1.1 — matches required version
  • tag="deepgram-examples" present on Deepgram API call (segmenter.py:106)
  • ✅ No hardcoded credentials
  • ✅ Error handling: credential check + file existence check in main()
  • ✅ Tests import from src/ and call the example's actual functions
  • ✅ Transcript assertions use length/duration proportionality (min_chars = max(5, audio_duration_sec * 2)) — no flaky word lists
  • ✅ Credential check runs before SDK operations
  • ℹ️ Minor: torchaudio imported but unused in segmenter.py:24 (silero_vad handles loading via read_audio). Non-blocking.

Documentation

  • ✅ README has "What you'll build", env vars table with console links, install/run instructions, CLI options, and "How it works" breakdown
  • .env.example present and complete

✓ All checks pass. Ready for merge.


Review by Lead on 2026-04-10

@github-actions
Copy link
Copy Markdown
Contributor Author

Code Review

Overall: APPROVED

Tests ran ✅

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s
    2.79s - 4.32s
    4.51s - 12.54s
    12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Detected 4 speech region(s). Transcribing...
  Transcribed 4 segment(s), 331 total chars
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Results: 4 passed, 0 failed

Integration genuineness

Pass — Silero VAD SDK is imported and used for real speech detection (get_speech_timestamps, load_silero_vad, read_audio). Audio flows through Silero VAD for segmentation, then each detected speech region is transcribed via the Deepgram SDK. Silero VAD is a local processing library (no credentials needed), so .env.example correctly lists only DEEPGRAM_API_KEY. No bypass — Deepgram is used for STT, Silero for VAD, each in its proper role.

Code quality

  • Official Deepgram SDK deepgram-sdk==6.1.1 (correct pinned version)
  • tag="deepgram-examples" present on all Deepgram API calls
  • No hardcoded credentials
  • Error handling covers missing API key and missing audio file
  • Tests import from src/ and call the example's actual code (detect_speech_regions, extract_segment_bytes, process_audio)
  • Credential check runs first (exit 2) before SDK imports
  • Transcript assertions use length/duration proportionality, not word lists

Documentation

  • README includes "What you'll build", env vars table with console link, install/run instructions, CLI options, and how-it-works section
  • .env.example present and complete

✓ All checks pass. Ready for merge.


Review by Lead on 2026-04-11

@github-actions
Copy link
Copy Markdown
Contributor Author

@deepgram/devrel — VP escalation: this PR has status:review-passed but no E2E check has ever run. The lead-e2e workflow does not support workflow_dispatch, so it cannot be manually re-triggered.

State: labels=type:example,status:review-passed,language:python,integration:silero-vad; no checks have run.

Last activity: 2026-04-11T01:02:13Z

@github-actions
Copy link
Copy Markdown
Contributor Author

Code Review

Overall: CHANGES REQUESTED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED
tests/test_example.py::test_extract_segment_bytes PASSED
tests/test_example.py::test_process_audio_end_to_end PASSED
tests/test_example.py::test_vad_parameters_affect_output PASSED

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s
    2.79s - 4.32s
    4.51s - 12.54s
    12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
    [0.1s-2.7s] conf=1.00
    [2.8s-4.3s] conf=0.53
    [4.5s-12.5s] conf=1.00
    [12.7s-25.4s] conf=1.00
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Results: 4 passed, 0 failed

Integration genuineness ✅

  • ✅ Silero VAD SDK imported and used (load_silero_vad, get_speech_timestamps, read_audio)
  • ✅ Real VAD processing happens locally (not mocked)
  • ✅ Deepgram SDK used for transcription via client.listen.v1.media.transcribe_file()
  • ✅ No bypass — Silero VAD is a local preprocessing library, not a Deepgram wrapper; using DeepgramClient directly is correct here
  • ✅ No raw WebSocket/fetch calls to Deepgram API

Code quality

  • deepgram-sdk==6.1.1 — correct version
  • tag="deepgram-examples" present on Deepgram API call
  • ✅ No hardcoded credentials
  • ✅ Error handling for missing API key and missing files
  • ✅ Tests import from src/ and call actual example code
  • ✅ Credential check runs FIRST (exit 2) before SDK imports
  • ✅ Transcript assertions use length/duration proportionality (not word lists)
  • ⚠️ Minor: unused import torchaudio on line 24 of src/segmenter.pytorchaudio is a runtime dependency of silero_vad but is not directly used in the source file. Please remove the unused import.

Documentation ✅

  • ✅ README includes "What you'll build", env vars with console links, install/run instructions, CLI options, and how-it-works
  • .env.example present and complete

One minor fix needed: remove the unused import torchaudio from src/segmenter.py:24. Everything else looks great — tests pass with real credentials, integration is genuine, and code quality is solid.

Please address the item above. The fix agent will pick this up.


Review by Lead on 2026-04-11

@github-actions github-actions bot added status:fix-needed Tests failing — fix agent queued and removed status:review-passed Self-review passed labels Apr 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor Author

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED
tests/test_example.py::test_extract_segment_bytes PASSED
tests/test_example.py::test_process_audio_end_to_end PASSED
tests/test_example.py::test_vad_parameters_affect_output PASSED

4 passed, 12 warnings in 6.28s

Integration genuineness

All 6 checks pass:

  1. Silero VAD SDK imported and used (get_speech_timestamps, load_silero_vad, read_audio)
  2. Real VAD processing call made on actual audio — not mocked
  3. .env.example lists DEEPGRAM_API_KEY (Silero VAD is a local library — no cloud credentials needed)
  4. Tests exit with code 2 if credentials are missing
  5. BYPASS CHECK — N/A: Silero VAD is a local audio processing library, not a Deepgram speech interface. Using DeepgramClient directly for transcription is the correct pattern here.
  6. NO RAW PROTOCOL CHECK — no raw WebSocket/fetch calls; official Deepgram SDK used throughout

Code quality

  • ✅ Official Deepgram SDK used (deepgram-sdk==6.1.1 — current required version)
  • tag="deepgram-examples" present on Deepgram API call (segmenter.py:106)
  • ✅ No hardcoded credentials
  • ✅ Error handling: credential check + file existence check in main()
  • ✅ Tests import from src/ and call the example's actual functions (detect_speech_regions, extract_segment_bytes, process_audio)
  • ✅ Transcript assertions use length/duration proportionality (audio_duration_sec * 2 min chars) — not word lists
  • ✅ Credential check runs first in tests (exit 2 before any SDK import)
  • ⚠️ Minor: import torchaudio on line 24 of segmenter.py is unused (silero_vad depends on it at runtime but the source file doesn't use it directly). Cosmetic only — does not affect functionality.

Documentation

  • ✅ README: "What you'll build" section, env vars table with console link, install & run instructions, CLI options, key parameters, how-it-works explanation
  • .env.example present and complete

✓ All checks pass. Ready for merge.


Review by Lead on 2026-04-11

@github-actions github-actions bot added status:review-passed Self-review passed and removed status:fix-needed Tests failing — fix agent queued labels Apr 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor Author

Code Review

Overall: APPROVED

Tests ran ✅

tests/test_example.py::test_detect_speech_regions PASSED
tests/test_example.py::test_extract_segment_bytes PASSED
tests/test_example.py::test_process_audio_end_to_end PASSED
tests/test_example.py::test_vad_parameters_affect_output PASSED

4 passed, 12 warnings in 6.76s

── detect_speech_regions ──
  Found 4 speech region(s)
    0.07s - 2.69s | 2.79s - 4.32s | 4.51s - 12.54s | 12.71s - 25.41s
  ✓ detect_speech_regions finds speech in real audio

── extract_segment_bytes ──
  ✓ extract_segment_bytes produces valid WAV

── process_audio (end-to-end) ──
  Transcribed 4 segment(s), 331 total chars
    [0.1s-2.7s] conf=1.00 'Yeah. As as much as, it's worth'
    [2.8s-4.3s] conf=0.53 'Celebrating'
    [4.5s-12.5s] conf=1.00 'The first, spacewalk, with an all female team...'
    [12.7s-25.4s] conf=1.00 'And, I think if it signifies anything, it is...'
  ✓ End-to-end pipeline transcribes real speech correctly

── VAD parameters ──
  threshold=0.5 → 4 regions, threshold=0.9 → 8 regions
  ✓ VAD parameters affect segmentation output

Integration genuineness

Pass — All 6 checks satisfied:

  1. Silero VAD SDK imported and used (silero_vad.read_audio, load_silero_vad, get_speech_timestamps)
  2. Real VAD API calls made — not mocked or hardcoded
  3. .env.example complete (Silero VAD is a local model, no API key required — only DEEPGRAM_API_KEY needed)
  4. Tests exit with code 2 if credentials are missing
  5. Bypass check: Silero VAD is a local audio pre-processing library (voice activity detection), not a speech transcription wrapper. Audio flows through Silero VAD for segmentation, then each segment goes to Deepgram for STT — correct architecture
  6. No raw protocol: All Deepgram API calls go through the official SDK (client.listen.v1.media.transcribe_file)

Code quality

  • ✅ Official Deepgram SDK: deepgram-sdk==6.1.1 (matches required version)
  • tag="deepgram-examples" present on all Deepgram API calls
  • ✅ No hardcoded credentials
  • ✅ Error handling: credential check and file existence check in main()
  • ✅ Tests import from src/ and call the example's actual code (from segmenter import ...)
  • ✅ Transcript assertions use length/duration proportionality (min_chars = max(5, audio_duration_sec * 2)) — no specific word lists
  • ✅ Credential check runs first in tests (lines 13–22) before SDK imports
  • ⚠️ Minor: import torchaudio in segmenter.py is unused (silero_vad handles it internally). Non-blocking — can be cleaned up later.

Documentation

  • ✅ README has "What you'll build" section
  • ✅ All env vars listed with where-to-get links
  • ✅ Install and run instructions present
  • ✅ CLI options documented
  • .env.example present and complete

✓ All checks pass. Ready for merge.


Review by Lead on 2026-04-11

@lukeocodes lukeocodes merged commit 9407453 into main Apr 11, 2026
@lukeocodes lukeocodes deleted the example/530-silero-vad-speech-segmentation-python branch April 11, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:silero-vad Integration: Silero VAD language:python Language: Python status:review-passed Self-review passed type:example New example

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant