How to Decode with Pretrained Kaldi Models
All I want is to decode this small little wav..
Motivation
Recently I had to run some simple audio decoding with Kaldi. But to my dismay, documentation was unintuitive and lacking! I reckoned the demand to convert some audio files to text with pretrained Kaldi models would be huge -- and that various tutorials would exist -- but such was not the case.
As far as I could find, these two (1, 2) tutorials were the only posts (other than the official docs) offering a "Quickstart" guide to trainless decoding with Kaldi. But neither references provide an exact solution to build a simple audio-in-text-out pipeline.
Maybe that's a good thing. Maybe thats da Kaldi way..
Model of choice
At the time of writing, the most up-to-date model in the Kaldi models page is the Librispeech ASR model. So we'll use the wsj
egg because that seems like the standard when using the Librispeech model.
'Tis a guilty pleasure of mine, calling examples from the egs
directory eggs.
Overview
Initially we have convert_me.wav
, the audio to transcribe.
Through Kaldi we want to obtain transcript.txt
, containing the ASR output.
These are the steps required:
- Download Kaldi Docker image and pretrianed Librispeech model.
- Create
wav.scp
andutt2spk
. - Create
spk2utt
- Copy files to
hi_res
directory and validate. - Extract MFCC features.
- Extract i-vectors.
- Create decoding graph with small language model.
- Rescore with large (or medium) language model.
- Extract transcription.
Does decoding a single little .wav file have to be this complicated, you ask? Yes. Dats da Kaldi way.
Let's go through the steps one by one. Most code in this writeup is a remix of code found in the aforementioned tutorials.
1. Download Kaldi Docker image and pretrianed Librispeech model.
Running Kaldi Docker
$ docker run -it kaldiasr/kaldi:latest bash # Run Kaldi image.
$ cd egs/wsj/s5 # Go to wsj model directory.
We'll assume you've put convert_me.wav
at /opt/kaldi/egs/wsj/s5
. For this you might want to mount a volume (through which you communicate with the outside world) with -v
when you run the Docker container.
Every command after this will be executed at /opt/kaldi/egs/wsj/s5
within the Kaldi Docker instance.
Create a directory called data
, and under it create a directory that contains all information pertaining to a single decoding job. A "single decoding job" could mean decoding a single file or multiple files. In this instance, we'll name our job convert_me
, and our job will pertain decoding a single file, convert_me.wav
.
$ mkdir -p data/convert_me
Download and prepare the pretrained model
# From http://kaldi-asr.org/models.html.
$ wget http://kaldi-asr.org/models/13/0013_librispeech_s5.tar.gz
$ tar -xzf 0013_librispeech_s5.tar.gz # Unzip model.
$ cp -r 0013_librispeech_v1/data/lang_test* data/ # Copy language models.
$ cp -r 0013_librispeech_v1/exp . # Copy chain model and i-vector extractor.
2. Create wav.scp
and utt2spk
manually.
Create data/convert_me/wav.scp
.
wav.scp
lists all audio files to decode in a single job, along with their recording ids.
Each pair of recording id and file path is formatted as "recording_id file_path
" and is listed in a new line.
For example, if we assign an id of utt1
to our convert_me.wav
, we our wav.scp
would be:
utt1 /opt/kaldi/egs/wsj/s5/convert_me.wav
The id and path fields are separated by a space.
Next, create a file named data/convert_me/utt2spk
.
utt2spk
maps each utterance id to a speaker id. Since we don't care about designating different speaker ids in this example, we tell kaldi the whole audio file is a single utterance, and just set the spaker id to the utterance id (utt1
). Since we are not using a segments
file, utterance id, speaker id, and recording id (from step 2) must all be the same.
Contents of data/convert_me/utt2spk
is as follows:
utt1 utt1
3. Create spk2utt
from utt2spk
spk2utt
is the opposite of utt2spk
. The file maps speaker ids to utterance ids. This file is created automatically via a script:
$ utils/utt2spk_to_spk2utt.pl data/convert_me/utt2spk > data/convert_me/spk2utt
4. Copy decoding metadata to another directory, and validate.
$ utils/copy_data_dir.sh data/convert_me data/convert_me_hires
5. Extract MFCC features.
Set appropriate environment variables.
$ export train_cmd="run.pl"
$ export decode_cmd="run.pl --mem 2G"
The command above initializes variables $train_cmd
and $decode_cmd
that we'll use later.
Next, extract mfcc features from our audio file:
$ steps/make_mfcc.sh \
--nj 1 \
--mfcc-config conf/mfcc_hires.conf \
--cmd "$train_cmd" data/convert_me_hires
$ steps/compute_cmvn_stats.sh data/convert_me_hires
$ utils/fix_data_dir.sh data/convert_me_hires
The --nj
argument specifies the number of split jobs. Since we only have a single entry in our wav.scp
, our only option for --nj
is 1
. We use the provided configuration for --mccc-config
.
6. Extract i-vectors.
$ nspk=$(wc -l <data/convert_me_hires/spk2utt)
$ steps/online/nnet2/extract_ivectors_online.sh \
--cmd "$train_cmd" --nj "${nspk}" \
data/convert_me_hires exp/nnet3_cleaned/extractor \
exp/nnet3_cleaned/ivectors_convert_me_hires
7. Create decoding graph with small language model.
Create decoding graph:
$ export dir=exp/chain_cleaned/tdnn_1d_sp
$ export graph_dir=$dir/graph_tgsmall
$ utils/mkgraph.sh --self-loop-scale 1.0 --remove-oov \
data/lang_test_tgsmall $dir $graph_dir
Decode using smaller language model:
$ steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \
--nj 1 --cmd "$decode_cmd" \
--online-ivector-dir exp/nnet3_cleaned/ivectors_convert_me_hires \
$graph_dir data/convert_me_hires $dir/decode_convert_me_tgsmall
8. Rescore with large (or medium) language model.
Set skip_scoring
to true
in line 10
of steps/lmrescore_const_arpa.sh
.
Then, run large language rescoring:
$ steps/lmrescore_const_arpa.sh --cmd "$decode_cmd" data/lang_test_{tgsmall,tglarge} \
data/convert_me_hires $dir/decode_convert_me_{tgsmall,tglarge}
All decoding is done! Now we just need to extract the transcripted text.
9. Extract transcription.
$ steps/get_ctm.sh data/convert_me exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall \
exp/chain_cleaned/tdnn_1d_sp/decode_convert_me_tglarge
The resulting transcript can be found at exp/chain_cleaned/tdnn_1d_sp/decode_convert_me_tglarge/score_20/convert_me.ctm
.