By Seongmin Park in nlp — Dec 22, 2020

How to Decode with Pretrained Kaldi Models

All I want is to decode this small little wav..

Check out daglo: The speech-to-text recorder that nails every word and creates custom summaries. They pay me (a bit measly) salary, do awesome RAG, and are full of Vim addicts like you and me. Check us out!

Motivation

Recently I had to run some simple audio decoding with Kaldi. But to my dismay, documentation was unintuitive and lacking! I reckoned the demand to convert some audio files to text with pretrained Kaldi models would be huge -- and that various tutorials would exist -- but such was not the case.

As far as I could find, these two (1, 2) tutorials were the only posts (other than the official docs) offering a "Quickstart" guide to trainless decoding with Kaldi. But neither references provide an exact solution to build a simple audio-in-text-out pipeline.

Maybe that's a good thing. Maybe thats da Kaldi way..

Model of choice

At the time of writing, the most up-to-date model in the Kaldi models page is the Librispeech ASR model. So we'll use the wsj egg because that seems like the standard when using the Librispeech model.

'Tis a guilty pleasure of mine, calling examples from the egs directory eggs.

Overview

Initially we have convert_me.wav, the audio to transcribe.

Through Kaldi we want to obtain transcript.txt, containing the ASR output.

These are the steps required:

Download Kaldi Docker image and pretrianed Librispeech model.
Create wav.scp and utt2spk.
Create spk2utt
Copy files to hi_res directory and validate.
Extract MFCC features.
Extract i-vectors.
Create decoding graph with small language model.
Rescore with large (or medium) language model.
Extract transcription.

Does decoding a single little .wav file have to be this complicated, you ask? Yes. Dats da Kaldi way.

Let's go through the steps one by one. Most code in this writeup is a remix of code found in the aforementioned tutorials.

1. Download Kaldi Docker image and pretrianed Librispeech model.

Running Kaldi Docker

$ docker run -it kaldiasr/kaldi:latest bash # Run Kaldi image.
$ cd egs/wsj/s5 # Go to wsj model directory.

We'll assume you've put convert_me.wav at /opt/kaldi/egs/wsj/s5. For this you might want to mount a volume (through which you communicate with the outside world) with -v when you run the Docker container.

Every command after this will be executed at /opt/kaldi/egs/wsj/s5 within the Kaldi Docker instance.

Create a directory called data, and under it create a directory that contains all information pertaining to a single decoding job. A "single decoding job" could mean decoding a single file or multiple files. In this instance, we'll name our job convert_me, and our job will pertain decoding a single file, convert_me.wav.

$ mkdir -p data/convert_me

Download and prepare the pretrained model

# From http://kaldi-asr.org/models.html.
$ wget http://kaldi-asr.org/models/13/0013_librispeech_s5.tar.gz 
$ tar -xzf 0013_librispeech_s5.tar.gz # Unzip model.
$ cp -r 0013_librispeech_v1/data/lang_test* data/ # Copy language models.
$ cp -r 0013_librispeech_v1/exp . # Copy chain model and i-vector extractor.

2. Create `wav.scp` and `utt2spk` manually.

Create data/convert_me/wav.scp.

wav.scp lists all audio files to decode in a single job, along with their recording ids.

Each pair of recording id and file path is formatted as "recording_id file_path" and is listed in a new line.
For example, if we assign an id of utt1 to our convert_me.wav, we our wav.scp would be:

utt1 /opt/kaldi/egs/wsj/s5/convert_me.wav

The id and path fields are separated by a space.

Next, create a file named data/convert_me/utt2spk.

utt2spk maps each utterance id to a speaker id. Since we don't care about designating different speaker ids in this example, we tell kaldi the whole audio file is a single utterance, and just set the spaker id to the utterance id (utt1). Since we are not using a segments file, utterance id, speaker id, and recording id (from step 2) must all be the same.

Contents of data/convert_me/utt2spk is as follows:

utt1 utt1

3. Create `spk2utt` from `utt2spk`

spk2utt is the opposite of utt2spk. The file maps speaker ids to utterance ids. This file is created automatically via a script:

$ utils/utt2spk_to_spk2utt.pl data/convert_me/utt2spk > data/convert_me/spk2utt

4. Copy decoding metadata to another directory, and validate.

$ utils/copy_data_dir.sh data/convert_me data/convert_me_hires

5. Extract MFCC features.

Set appropriate environment variables.

$ export train_cmd="run.pl"
$ export decode_cmd="run.pl --mem 2G"

The command above initializes variables $train_cmd and $decode_cmd that we'll use later.

Next, extract mfcc features from our audio file:

$ steps/make_mfcc.sh \
    --nj 1 \
    --mfcc-config conf/mfcc_hires.conf \
    --cmd "$train_cmd" data/convert_me_hires
$ steps/compute_cmvn_stats.sh data/convert_me_hires
$ utils/fix_data_dir.sh data/convert_me_hires

The --nj argument specifies the number of split jobs. Since we only have a single entry in our wav.scp, our only option for --nj is 1. We use the provided configuration for --mccc-config.

6. Extract i-vectors.

$ nspk=$(wc -l <data/convert_me_hires/spk2utt)
$ steps/online/nnet2/extract_ivectors_online.sh  \
    --cmd "$train_cmd" --nj "${nspk}" \
    data/convert_me_hires exp/nnet3_cleaned/extractor \
    exp/nnet3_cleaned/ivectors_convert_me_hires

7. Create decoding graph with small language model.

Create decoding graph:

$ export dir=exp/chain_cleaned/tdnn_1d_sp
$ export graph_dir=$dir/graph_tgsmall
$ utils/mkgraph.sh --self-loop-scale 1.0 --remove-oov \
    data/lang_test_tgsmall $dir $graph_dir

Decode using smaller language model:

$ steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \
    --nj 1 --cmd "$decode_cmd" \
    --online-ivector-dir exp/nnet3_cleaned/ivectors_convert_me_hires \
    $graph_dir data/convert_me_hires $dir/decode_convert_me_tgsmall

8. Rescore with large (or medium) language model.

Set skip_scoring to true in line 10 of steps/lmrescore_const_arpa.sh.

Then, run large language rescoring:

$ steps/lmrescore_const_arpa.sh --cmd "$decode_cmd" data/lang_test_{tgsmall,tglarge} \
    data/convert_me_hires $dir/decode_convert_me_{tgsmall,tglarge}

All decoding is done! Now we just need to extract the transcripted text.

9. Extract transcription.

$ steps/get_ctm.sh data/convert_me exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall \
    exp/chain_cleaned/tdnn_1d_sp/decode_convert_me_tglarge

The resulting transcript can be found at exp/chain_cleaned/tdnn_1d_sp/decode_convert_me_tglarge/score_20/convert_me.ctm.