Skip to content

Commit 3191027

Browse files
committed
Audio: MFCC: Add Python script for speech to text with Whisper model
Add sof_mel_to_text_live_dsp_vad.py that captures mel spectrogram frames from ALSA with embedded DSP VAD flag and performs live speech-to-text transcription using OpenVINO Whisper. The script buffers mel frames during speech and triggers Whisper inference when silence is detected after speech. Capture runs continuously in a separate thread during inference to avoid frame drops. Replace the old README.txt with a comprehensive README.md that documents the MFCC tuning tools, testbench usage with run_mfcc.sh, output file formats, Matlab/Octave decode and plotting scripts, and the new live transcription workflow. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
1 parent 0908aed commit 3191027

3 files changed

Lines changed: 569 additions & 52 deletions

File tree

src/audio/mfcc/tune/README.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# SOF MFCC Tuning Tools
2+
3+
This directory contains a tool to create configuration blob for SOF
4+
MFCC component. It's simply run in Matlab or Octave with command
5+
`setup_mfcc`. The MFCC configuration parameters can be edited from the
6+
script.
7+
8+
## Testbench
9+
10+
The configuration can be test run with testbench. First the test topologies
11+
need to be created with `scripts/build-tools.sh -t`. Next the testbench
12+
is built with `scripts/rebuild-testbench.sh`.
13+
14+
Once the previous steps are done, a sample wav file can be processed
15+
with script `run_mfcc.sh`. The script converts the input to raw 16 kHz
16+
stereo format and runs the testbench for S16, S24, and S32 bit depths,
17+
producing both cepstral coefficient (MFCC) and Mel spectrogram outputs.
18+
19+
```
20+
./run_mfcc.sh /usr/share/sounds/alsa/Front_Center.wav
21+
```
22+
23+
Output files from host testbench:
24+
25+
| File | Content |
26+
|------|---------|
27+
| `mfcc_s16.raw`, `mfcc_s24.raw`, `mfcc_s32.raw` | Cepstral coefficients |
28+
| `mel_s16.raw`, `mel_s24.raw`, `mel_s32.raw` | Mel spectrogram |
29+
30+
If the `XTENSA_PATH` environment variable is set, the script also runs
31+
the Xtensa build of the testbench (via `xt-run`) and produces additional
32+
output files prefixed with `xt_`:
33+
34+
| File | Content |
35+
|------|---------|
36+
| `xt_mfcc_s16.raw`, `xt_mfcc_s24.raw`, `xt_mfcc_s32.raw` | Cepstral coefficients |
37+
| `xt_mel_s16.raw`, `xt_mel_s24.raw`, `xt_mel_s32.raw` | Mel spectrogram |
38+
39+
## Decoding and Plotting
40+
41+
All output files can be decoded and plotted at once in Matlab or Octave
42+
with the `decode_all.m` script:
43+
44+
```matlab
45+
decode_all
46+
```
47+
48+
This calls `decode_ceps` for each MFCC file (13 cepstral coefficients) and
49+
`decode_mel` for each Mel file (80 Mel bins), plotting spectrograms for all
50+
files that exist including the Xtensa variants.
51+
52+
Individual files can also be decoded manually:
53+
54+
```matlab
55+
[ceps, t, n] = decode_ceps('mfcc_s16.raw', 13);
56+
```
57+
58+
In the above it's known from configuration script that MFCC was set up to
59+
output 13 cepstral coefficients from each FFT → Mel → DCT → Cepstral
60+
coefficients computation run.
61+
62+
The 80 bands Mel output can be visualized with command:
63+
64+
```matlab
65+
[mel, t, n] = decode_mel('mel_s16.raw', 80);
66+
```
67+
68+
## Live Whisper Transcription with DSP VAD
69+
70+
The directory contains a Python script `sof_mel_to_text_live_dsp_vad.py`.
71+
It can be used with development topologies
72+
`sof-arl-cs42l43-l0-cs35l56-l23-mfcc.tplg` and
73+
`sof-mtl-rt713-l0-rt1316-l12-mfcc.tplg`. It captures from default audio
74+
device `hw:0,47` (headset microphone) Mel audio features and VAD flags.
75+
The captured frames with detected speech are sent to Whisper speech
76+
recognizer model for conversion to text.
77+
78+
### Prerequisites
79+
80+
The script needs OpenVINO. Please follow the install procedure from
81+
<https://docs.openvino.ai/2025/get-started/install-openvino.html>.
82+
83+
The following Python pip installs are needed into the same OpenVINO venv:
84+
85+
```bash
86+
pip install openvino openvino-tokenizers openvino-genai
87+
pip install optimum[intel]
88+
pip install transformers
89+
pip install huggingface_hub
90+
```
91+
92+
### NPU / GPU Support
93+
94+
The script by default runs the Whisper encoder model in the NPU. To
95+
use the NPU, install the driver from
96+
<https://github.com/intel/linux-npu-driver/releases>. If the NPU is not
97+
available, change the encoder to CPU with run option `--encoder-device CPU`.
98+
With a GPU both `--encoder-device GPU` and `--decoder-device GPU` can be set.

src/audio/mfcc/tune/README.txt

Lines changed: 0 additions & 52 deletions
This file was deleted.

0 commit comments

Comments
 (0)