Pretrained ngram language models
A pretrained 1gram language model is included in this repository at language_model/pretrained_language_models/openwebtext_1gram_lm_sil
. Pretrained 3gram and 5gram language models are available for download here (languageModel.tar.gz
and languageModel_5gram.tar.gz
) and should likewise be placed in the language_model/pretrained_language_models/
directory. Note that the 3gram model requires ~60GB of RAM, and the 5gram model requires ~300GB of RAM. Furthermore, OPT 6.7b requires a GPU with at least ~12.4 GB of VRAM to load for inference.
Dependencies
CMake >= 3.14
gcc >= 10.1
pytorch == 1.13.1
To install CMake and gcc on Ubuntu, simply run:
sudo apt-get install build-essential
Install language model python package
Use the setup_lm.sh
script in the root directory of this repository to create the b2txt_lm
conda env and install the lm-decoder
package to it. Before install, make sure that there is no build
or fc_base
directory in your language_model/runtime/server/x86
directory, as this may cause the build to fail.
Using a pretrained ngram language model
The language-model-standalone.py
script included here is made to work with the evaluate_model.py
script in the model_training
directory. language-model-standalone.py
will do the following when run:
- Initialize
opt-6.7b
it on the specified gpu (--gpu_number
arg). The first time you run the script, it will automatically downloadopt-6.7b
from huggingface. - Initialize the ngram language model (specified with the
--lm_path
arg) - Connect to the
localhost
redis server (or a different server, specified by the--redis_ip
and--redis_port
args) - Wait to receive phoneme logits via redis, and then make word predictions and pass them back via redis.
To run the 1gram language model from the root directory of this repository:
conda activate b2txt_lm
python language_model/language-model-standalone.py --lm_path language_model/pretrained_language_models/openwebtext_1gram_lm_sil --do_opt --nbest 100 --acoustic_scale 0.325 --blank_penalty 90 --alpha 0.55 --redis_ip localhost --gpu_number 0
To run the 3gram language model from the root directory of this repository (requires ~60GB RAM):
conda activate b2txt_lm
python language_model/language-model-standalone.py --lm_path language_model/pretrained_language_models/openwebtext_3gram_lm_sil --do_opt --nbest 100 --acoustic_scale 0.325 --blank_penalty 90 --alpha 0.55 --redis_ip localhost --gpu_number 0
To run the 5gram language model from the root directory of this repository (requires ~300GB of RAM):
conda activate b2txt_lm
python language_model/language-model-standalone.py --lm_path language_model/pretrained_language_models/openwebtext_5gram_lm_sil --rescore --do_opt --nbest 100 --acoustic_scale 0.325 --blank_penalty 90 --alpha 0.55 --redis_ip localhost --gpu_number 0
Build a new phoneme-to-words ngram language model from scratch
-
First, build binaries for building the language model:
- Build SRILM:
cd srilm-1.7.3 export SRILM=$PWD make MAKE_PIC=yes World make cleanest export PATH=$PATH:$PWD/bin/i686-m64
- Build openfst and other stuff:
cd runtime/server/x86 mkdir build cd build cmake .. make -j8
-
Build ngram LM:
cd ./examples/speech/s0/
run.sh output_dir dict_path train_corpus sil_prob formatted_train_corpus prune_threshold order