Recently I had to run some simple audio decoding with Kaldi. But to my dismay, documentation was unintuitive and lacking! I reckoned the demand to convert some audio files to text with pretrained Kaldi models would be huge -- and that various tutorials would exist -- but such was not the case.

As far as I could find, these two (1, 2) tutorials were the only posts (other than the official docs) offering a "Quickstart" guide to trainless decoding with Kaldi. But neither references provide an exact solution to build a simple audio-in-text-out pipeline.

Maybe that's a good thing. Maybe thats da Kaldi way..

Model of choice

At the time of writing, the most up-to-date model in the Kaldi models page is the Librispeech ASR model. So we'll use the wsj egg because that seems like the standard when using the Librispeech model.

'Tis a guilty pleasure of mine, calling examples from the egs directory eggs.


Initially we have convert_me.wav, the audio to transcribe.

Through Kaldi we want to obtain transcript.txt, containing the ASR output.

These are the steps required:

  1. Download Kaldi Docker image and pretrianed Librispeech model.
  2. Create wav.scp and utt2spk.
  3. Create spk2utt
  4. Copy files to hi_res directory and validate.
  5. Extract MFCC features.
  6. Extract i-vectors.
  7. Create decoding graph with small language model.
  8. Rescore with large (or medium) language model.
  9. Extract transcription.

Does decoding a single little .wav file have to be this complicated, you ask? Yes. Dats da Kaldi way.

Let's go through the steps one by one. Most code in this writeup is a remix of code found in the aforementioned tutorials.

1. Download Kaldi Docker image and pretrianed Librispeech model.

Running Kaldi Docker
$ docker run -it kaldiasr/kaldi:latest bash # Run Kaldi image.
$ cd egs/wsj/s5 # Go to wsj model directory.

We'll assume you've put convert_me.wav at /opt/kaldi/egs/wsj/s5. For this you might want to mount a volume (through which you communicate with the outside world) with -v when you run the Docker container.

Every command after this will be executed at /opt/kaldi/egs/wsj/s5 within the Kaldi Docker instance.

Create a directory called data, and under it create a directory that contains all information pertaining to a single decoding job. A "single decoding job" could mean decoding a single file or multiple files. In this instance, we'll name our job convert_me, and our job will pertain decoding a single file, convert_me.wav.

$ mkdir -p data/convert_me
Download and prepare the pretrained model
# From
$ wget 
$ tar -xzf 0013_librispeech_s5.tar.gz # Unzip model.
$ cp -r 0013_librispeech_v1/data/lang_test* data/ # Copy language models.
$ cp -r 0013_librispeech_v1/exp . # Copy chain model and i-vector extractor.

2. Create wav.scp and utt2spk manually.

Create data/convert_me/wav.scp.

wav.scp lists all audio files to decode in a single job, along with their recording ids.

Each pair of recording id and file path is formatted as "recording_id file_path" and is listed in a new line.
For example, if we assign an id of utt1 to our convert_me.wav, we our wav.scp would be:

utt1 /opt/kaldi/egs/wsj/s5/convert_me.wav

The id and path fields are separated by a space.

Next, create a file named data/convert_me/utt2spk.

utt2spk maps each utterance id to a speaker id. Since we don't care about designating different speaker ids in this example, we tell kaldi the whole audio file is a single utterance, and just set the spaker id to the utterance id (utt1). Since we are not using a segments file, utterance id, speaker id, and recording id (from step 2) must all be the same.

Contents of data/convert_me/utt2spk is as follows:

utt1 utt1

3. Create spk2utt from utt2spk

spk2utt is the opposite of utt2spk. The file maps speaker ids to utterance ids. This file is created automatically via a script:

$ utils/ data/convert_me/utt2spk > data/convert_me/spk2utt

4. Copy decoding metadata to another directory, and validate.

$ utils/ data/convert_me data/convert_me_hires

5. Extract MFCC features.

Set appropriate environment variables.

$ export train_cmd=""
$ export decode_cmd=" --mem 2G"

The command above initializes variables $train_cmd and $decode_cmd that we'll use later.

Next, extract mfcc features from our audio file:

$ steps/ \
    --nj 1 \
    --mfcc-config conf/mfcc_hires.conf \
    --cmd "$train_cmd" data/convert_me_hires
$ steps/ data/convert_me_hires
$ utils/ data/convert_me_hires

The --nj argument specifies the number of split jobs. Since we only have a single entry in our wav.scp, our only option for --nj is 1. We use the provided configuration for --mccc-config.

6. Extract i-vectors.

$ nspk=$(wc -l <data/convert_me_hires/spk2utt)
$ steps/online/nnet2/  \
    --cmd "$train_cmd" --nj "${nspk}" \
    data/convert_me_hires exp/nnet3_cleaned/extractor \

7. Create decoding graph with small language model.

Create decoding graph:

$ export dir=exp/chain_cleaned/tdnn_1d_sp
$ export graph_dir=$dir/graph_tgsmall
$ utils/ --self-loop-scale 1.0 --remove-oov \
    data/lang_test_tgsmall $dir $graph_dir

Decode using smaller language model:

$ steps/nnet3/ --acwt 1.0 --post-decode-acwt 10.0 \
    --nj 1 --cmd "$decode_cmd" \
    --online-ivector-dir exp/nnet3_cleaned/ivectors_convert_me_hires \
    $graph_dir data/convert_me_hires $dir/decode_convert_me_tgsmall

8. Rescore with large (or medium) language model.

Set skip_scoring to true in line 10 of steps/

Then, run large language rescoring:

$ steps/ --cmd "$decode_cmd" data/lang_test_{tgsmall,tglarge} \
    data/convert_me_hires $dir/decode_convert_me_{tgsmall,tglarge}

All decoding is done! Now we just need to extract the transcripted text.

9. Extract transcription.

$ steps/ data/convert_me exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall \

The resulting transcript can be found at exp/chain_cleaned/tdnn_1d_sp/decode_convert_me_tglarge/score_20/convert_me.ctm.

Tagged in:

nlp, asr, kaldi

Last Update: July 04, 2024