Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build model for Vosk #1690

Open
HassanTen opened this issue Dec 20, 2024 · 4 comments
Open

Build model for Vosk #1690

HassanTen opened this issue Dec 20, 2024 · 4 comments

Comments

@HassanTen
Copy link

I want to create a speech recognition model for numbers from 0 to 13 in Arabic using Google Colab. Are these steps correct?

  1. Setting Up the Environment in Google Colab

Installing Basic Tools

First, we need to install essential tools like build-essential, gfortran, and sox that are necessary for building tools in Kaldi and Vosk.

In the first cell in Google Colab, input the following commands:

!apt-get install -y build-essential gfortran
!apt-get install -y sox
!apt-get install -y python3-pip
!pip install kaldi-python

Installing Kaldi

After installing the basic tools, we will install Kaldi from GitHub using the following commands:

!git clone https://github.com/kaldi-asr/kaldi.git
%cd kaldi/tools
!make
%cd ../src
!./configure --use-cuda=no
!make

  1. Preparing Your Data (Numbers 0 to 13)

In this step, we need to upload audio files containing the numbers 0 to 13 and prepare the data in the required files.

Upload Audio Files (Numbers 0 to 13)

Upload your audio files to Google Colab. You can upload files through Colab's "Files" interface.

Example of file paths (upload the files under the audio/ folder):

audio/zero.wav
audio/one.wav
audio/two.wav
...
audio/thirteen.wav

Preparing the Required Files

We will create 3 main files: text, wav.scp, and utt2spk.

  1. The text File:

In this file, we write the utterance ID and the corresponding sentence. Each utterance ID will be unique, like speaker-0, speaker-1, and so on.

The content will look like this:

speaker-0 صفر
speaker-1 واحد
speaker-2 اثنان
speaker-3 ثلاثة
speaker-4 أربعة
speaker-5 خمسة
speaker-6 ستة
speaker-7 سبعة
speaker-8 ثمانية
speaker-9 تسعة
speaker-10 عشرة
speaker-11 أحد عشر
speaker-12 اثنا عشر
speaker-13 ثلاثة عشر

  1. The wav.scp File:

This file contains the path to each audio file along with the corresponding utterance-id. For example:

speaker-0 /path/to/audio/zero.wav
speaker-1 /path/to/audio/one.wav
speaker-2 /path/to/audio/two.wav
speaker-3 /path/to/audio/three.wav
...
speaker-13 /path/to/audio/thirteen.wav

  1. The utt2spk File:

In this file, we link each utterance-id to the speaker's name. In this case, it's always speaker.

speaker-0 speaker
speaker-1 speaker
speaker-2 speaker
speaker-3 speaker
...
speaker-13 speaker

  1. Preparing the Lexicon and Language Model

Creating the lexicon.txt File

In this file, you will need to write each word and its corresponding phonemes (in your case, the numbers 0 to 13). We need to use the phonemes for Arabic numbers, which are similar to those in the audio files.

The content will look like this:

صفر s ˈf r
واحد w ʌ h i d
اثنان ʔ t h aː n
ثلاثة t h l aː t a
أربعة ʔ r b ʕ a
خمسة x aː m s a
ستة s i t t a
سبعة s aː b ʕ a
ثمانية t h m aː n iː a
تسعة t s ʕ a
عشرة ʕ ʃ a r a
أحد عشر ʔ h d ʔ aʃ a r
اثنا عشر ʔ t h n aʃ a r
ثلاثة عشر t l aʕ t aʃ a r

Creating nonsilence_phones.txt and silence_phones.txt

nonsilence_phones.txt: This contains all the non-silent phonemes. You can extract them from lexicon.txt:

cut -d ' ' -f 2- lexicon.txt | sed 's/ /\n/g' | sort -u > nonsilence_phones.txt

silence_phones.txt: This contains the silent phonemes. In this case, the content could be:

echo -e 'SIL\noov\nSPN' > silence_phones.txt

Preparing the Language Directory

The next step is preparing the data/lang directory by running prepare_lang.sh:

utils/prepare_lang.sh data/local/dict "" data/local/lang data/lang

  1. Creating the Language Model

Creating corpus.txt

This file should contain all the sentences you want to use in your dataset. You can simply extract sentences from the text file by using a script to remove the utterance-id.

Installing SRILM for Language Model Creation

To install SRILM:

!wget https://www.speech.sri.com/projects/srilm/srilm.tar.gz
!tar -xzvf srilm.tar.gz

Then, install SRILM:

!./install_srilm.sh && ./env.sh

After installing SRILM, run lm_creation.sh to create the language model:

./lm_creation.sh

  1. Training the Model

Alignment of Data

Use the align_train.sh script to align the data:

./align_train.sh

Training the Model

After alignment, you can train the model using the run_tdnn_1j.sh script:

local/chain/tuning/run_tdnn_1j.sh

  1. Preparing the Final Model for Vosk

Preparing the Final Model

After training is complete, collect all the necessary files and prepare the model using the copy_final_result.sh script:

./copy_final_result.sh

Creating the model.conf File

You need to create the model.conf file to specify the model settings. For example:

--min-active=200
--max-active=3000
--beam=10.0
--lattice-beam=2.0
--acoustic-scale=1.0
--frame-subsampling-factor=3
--endpoint.silence-phones=1:2:3:4:5:6:7:8:9:10

  1. Evaluation and Using the Model

Now you have a model that is fully compatible with Vosk and can be used to recognize the numbers from 0 to 13.

@nshmyrev
Copy link
Collaborator

Looks ok

@HassanTen
Copy link
Author

Looks ok

I want to create a speech recognition model for numbers from 0 to 13 in Arabic using Google Colab.
Are there steps to install libraries and work on it?
Something like this: https://youtu.be/m-JzldXm9bQ?feature=shared

@nshmyrev
Copy link
Collaborator

@HassanTen
Copy link
Author

You can use our existing colab

https://github.com/alphacep/vosk-api/blob/master/python/example/colab/vosk-training.ipynb

Are these steps correct after installing vosk-training.ipynb?

Import necessary libraries

import os
import shutil
import re
import soundfile as sf
import pandas as pd
from g2p_en import G2p
from google.colab import drive

Install required libraries

!pip install --upgrade g2p-en pandas

Mount Google Drive

print(" Mounting Google Drive...")
try:
drive.mount('/content/drive')
except Exception as e:
print(f"⚠️ Error mounting Google Drive: {e}")
exit()

Clone Vosk recipes

print(" Cloning Vosk recipes...")
try:
!git clone https://github.com/alphacep/vosk-api.git
except Exception as e:
print(f"⚠️ Error cloning Vosk: {e}")
exit()

Prepare Vosk environment

try:
%cd vosk-api/egs/wsj
print("⚙️ Preparing environment...")
!./run.sh --stop-stage 1
%cd ../..
except Exception as e:
print(f"⚠️ Error preparing Vosk environment: {e}")
exit()

Define paths

ROOT_DIR = "/content/drive/MyDrive/custom_vosk"
DATA_DIR = "/content/custom_data"
WAV_DIR = os.path.join(DATA_DIR, "wav")
TRANSCRIPTS_CSV = os.path.join(ROOT_DIR, "transcripts.csv")
TRANSCRIPT_TXT = os.path.join(DATA_DIR, "transcript.txt")
LEXICON_TXT = os.path.join(DATA_DIR, "lexicon.txt")
FAILED_WORDS_TXT = os.path.join(DATA_DIR, "failed_words.txt")
MODEL_DIR = os.path.join(ROOT_DIR, "model")

Create directories

os.makedirs(WAV_DIR, exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)

Copy audio files

print(" Copying audio files...")
try:
wav_source_dir = os.path.join(ROOT_DIR, "wav")
for filename in os.listdir(wav_source_dir):
if filename.endswith(".wav"):
shutil.copy(os.path.join(wav_source_dir, filename), WAV_DIR)
except FileNotFoundError:
print(f"⚠️ Audio folder not found: {wav_source_dir}")
exit()
except Exception as e:
print(f"⚠️ Error copying audio files: {e}")
exit()

Read and process transcripts file

print(" Reading transcripts file...")
try:
transcripts_df = pd.read_csv(TRANSCRIPTS_CSV, header=None, names=['filename', 'transcript'])
except FileNotFoundError:
print(f"⚠️ Transcripts file not found at: {TRANSCRIPTS_CSV}")
exit()
except pd.errors.ParserError:
print(f"⚠️ Failed to parse transcripts file. Ensure it is in the correct format.")
exit()

Create processed transcripts file

print("✍️ Creating processed transcripts file...")
with open(TRANSCRIPT_TXT, "w", encoding="utf-8") as f:
for _, row in transcripts_df.iterrows():
f.write(f"{row['filename']} {row['transcript']}\n")

Text cleaning function

def clean_text(text):
"""Cleans text from unwanted characters."""
text = text.lower()
text = re.sub(r"[^\w\s'’‘-]", '', text)
text = re.sub(r"\s+", " ", text).strip()
return text

Verify audio files format

print(" Verifying audio files format...")
audio_errors = []
for filename in os.listdir(WAV_DIR):
if filename.endswith(".wav"):
file_path = os.path.join(WAV_DIR, filename)
try:
data, samplerate = sf.read(file_path)
if samplerate != 16000:
audio_errors.append(f"⚠️ {filename}: Incorrect sample rate (should be 16000 Hz).")
if data.ndim != 1:
audio_errors.append(f"⚠️ {filename}: File is not mono.")
except Exception as e:
audio_errors.append(f"⚠️ {filename}: Error reading file - {e}")

if audio_errors:
print("\n".join(audio_errors))
print("⚠️ Please correct the above errors in the audio files.")
exit()

Prepare lexicon file

print(" Preparing lexicon file...")
g2p = G2p()
words_in_transcript = set()

Extract and clean words from transcripts

with open(TRANSCRIPT_TXT, "r", encoding="utf-8") as f:
for line in f:
words = line.strip().split()[1:] # Ignore filenames
cleaned_words = [clean_text(word) for word in words]
words_in_transcript.update(cleaned_words)

Create lexicon file

with open(LEXICON_TXT, "w", encoding="utf-8") as f, open(FAILED_WORDS_TXT, "w", encoding="utf-8") as failed:
for word in sorted(words_in_transcript):
if word:
try:
pronunciation = " ".join(g2p(word))
f.write(f"{word} {pronunciation}\n")
except Exception as e:
failed.write(f"{word}\n")
print(f"⚠️ Error generating pronunciation for word '{word}': {e}")

print("✅ Lexicon file prepared.")

Train the model

print(" Starting model training...")
try:
%cd vosk-api/egs/wsj
!./run.sh --stage 2 --nj 1
!mkdir -p $MODEL_DIR
!mv exp/chain/tdnn $MODEL_DIR/custom_model
print(f"✅ Model trained successfully and saved at: {MODEL_DIR}")
except Exception as e:
print(f"⚠️ Error during model training: {e}")
exit()

Evaluation steps

print(" Model evaluation steps:")
print("1. Copy the model folder (custom_model) to your Vosk project.")
print("2. Use the following command-line command for speech recognition:")
print(" python test.py -i input.wav -m path/to/custom_model")
print("3. Test the model with new audio files to check accuracy.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants