How to Decode with Pretrained Kaldi Models
All I want is to decode this small little wav..
Check out daglo: The speech-to-text recorder that nails every word and whips up custom summaries. They pay me (a bit measly) salary, pays for my conference trips, does awesome RAG, and once made me give up my window seat for “better Wi-Fi”, whatever that means. Check us out!
Motivation
Recently I had to run some simple audio decoding with Kaldi. But to my dismay, documentation was unintuitive and lacking! I reckoned the demand to convert some audio files to text with pretrained Kaldi models would be huge -- and that various tutorials would exist -- but such was not the case.
As far as I could find, these two (1, 2) tutorials were the only posts (other than the official docs) offering a "Quickstart" guide to trainless decoding with Kaldi. But neither references provide an exact solution to build a simple audio-in-text-out pipeline.
Maybe that's a good thing. Maybe thats da Kaldi way..
Model of choice
At the time of writing, the most up-to-date model in the Kaldi models page is the Librispeech ASR model. So we'll use the wsj
egg because that seems like the standard when using the Librispeech model.
'Tis a guilty pleasure of mine, calling examples from the egs
directory eggs.
Overview
Initially we have convert_me.wav
, the audio to transcribe.
Through Kaldi we want to obtain transcript.txt
, containing the ASR output.
These are the steps required:
- Download Kaldi Docker image and pretrianed Librispeech model.
- Create
wav.scp
andutt2spk
. - Create
spk2utt
- Copy files to
hi_res
directory and validate. - Extract MFCC features.
- Extract i-vectors.
- Create decoding graph with small language model.
- Rescore with large (or medium) language model.
- Extract transcription.
Does decoding a single little .wav file have to be this complicated, you ask? Yes. Dats da Kaldi way.
Let's go through the steps one by one. Most code in this writeup is a remix of code found in the aforementioned tutorials.
1. Download Kaldi Docker image and pretrianed Librispeech model.
Running Kaldi Docker
$ docker run -it kaldiasr/kaldi:latest bash # Run Kaldi image.
$ cd egs/wsj/s5 # Go to wsj model directory.
We'll assume you've put convert_me.wav
at /opt/kaldi/egs/wsj/s5
. For this you might want to mount a volume (through which you communicate with the outside world) with -v
when you run the Docker container.
Every command after this will be executed at /opt/kaldi/egs/wsj/s5
within the Kaldi Docker instance.
Create a directory called data
, and under it create a directory that contains all information pertaining to a single decoding job. A "single decoding job" could mean decoding a single file or multiple files. In this instance, we'll name our job convert_me
, and our job will pertain decoding a single file, convert_me.wav
.
$ mkdir -p data/convert_me
Download and prepare the pretrained model
# From http://kaldi-asr.org/models.html.
$ wget http://kaldi-asr.org/models/13/0013_librispeech_s5.tar.gz
$ tar -xzf 0013_librispeech_s5.tar.gz # Unzip model.
$ cp -r 0013_librispeech_v1/data/lang_test* data/ # Copy language models.
$ cp -r 0013_librispeech_v1/exp . # Copy chain model and i-vector extractor.
2. Create wav.scp
and utt2spk
manually.
Create data/convert_me/wav.scp
.
wav.scp
lists all audio files to decode in a single job, along with their recording ids.
Each pair of recording id and file path is formatted as "recording_id file_path
" and is listed in a new line.
For example, if we assign an id of utt1
to our convert_me.wav
, we our wav.scp
would be:
utt1 /opt/kaldi/egs/wsj/s5/convert_me.wav
The id and path fields are separated by a space.
Next, create a file named data/convert_me/utt2spk
.
utt2spk
maps each utterance id to a speaker id. Since we don't care about designating different speaker ids in this example, we tell kaldi the whole audio file is a single utterance, and just set the spaker id to the utterance id (utt1
). Since we are not using a segments
file, utterance id, speaker id, and recording id (from step 2) must all be the same.
Contents of data/convert_me/utt2spk
is as follows:
utt1 utt1
3. Create spk2utt
from utt2spk
spk2utt
is the opposite of utt2spk
. The file maps speaker ids to utterance ids. This file is created automatically via a script:
$ utils/utt2spk_to_spk2utt.pl data/convert_me/utt2spk > data/convert_me/spk2utt
4. Copy decoding metadata to another directory, and validate.
$ utils/copy_data_dir.sh data/convert_me data/convert_me_hires
5. Extract MFCC features.
Set appropriate environment variables.
$ export train_cmd="run.pl"
$ export decode_cmd="run.pl --mem 2G"
The command above initializes variables $train_cmd
and $decode_cmd
that we'll use later.
Next, extract mfcc features from our audio file:
$ steps/make_mfcc.sh \
--nj 1 \
--mfcc-config conf/mfcc_hires.conf \
--cmd "$train_cmd" data/convert_me_hires
$ steps/compute_cmvn_stats.sh data/convert_me_hires
$ utils/fix_data_dir.sh data/convert_me_hires
The --nj
argument specifies the number of split jobs. Since we only have a single entry in our wav.scp
, our only option for --nj
is 1
. We use the provided configuration for --mccc-config
.
6. Extract i-vectors.
$ nspk=$(wc -l <data/convert_me_hires/spk2utt)
$ steps/online/nnet2/extract_ivectors_online.sh \
--cmd "$train_cmd" --nj "${nspk}" \
data/convert_me_hires exp/nnet3_cleaned/extractor \
exp/nnet3_cleaned/ivectors_convert_me_hires
7. Create decoding graph with small language model.
Create decoding graph:
$ export dir=exp/chain_cleaned/tdnn_1d_sp
$ export graph_dir=$dir/graph_tgsmall
$ utils/mkgraph.sh --self-loop-scale 1.0 --remove-oov \
data/lang_test_tgsmall $dir $graph_dir
Decode using smaller language model:
$ steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \
--nj 1 --cmd "$decode_cmd" \
--online-ivector-dir exp/nnet3_cleaned/ivectors_convert_me_hires \
$graph_dir data/convert_me_hires $dir/decode_convert_me_tgsmall
8. Rescore with large (or medium) language model.
Set skip_scoring
to true
in line 10
of steps/lmrescore_const_arpa.sh
.
Then, run large language rescoring:
$ steps/lmrescore_const_arpa.sh --cmd "$decode_cmd" data/lang_test_{tgsmall,tglarge} \
data/convert_me_hires $dir/decode_convert_me_{tgsmall,tglarge}
All decoding is done! Now we just need to extract the transcripted text.
9. Extract transcription.
$ steps/get_ctm.sh data/convert_me exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall \
exp/chain_cleaned/tdnn_1d_sp/decode_convert_me_tglarge
The resulting transcript can be found at exp/chain_cleaned/tdnn_1d_sp/decode_convert_me_tglarge/score_20/convert_me.ctm
.