3.5 KiB
Pretrained ngram language models
A pretrained 1gram language model is included in this repository at language_model/pretrained_language_models/openwebtext_1gram_lm_sil. Pretrained 3gram and 5gram language models are available for download here (languageModel.tar.gz and languageModel_5gram.tar.gz) and should likewise be placed in the language_model/pretrained_language_models/ directory. Note that the 3gram model requires ~60GB of RAM, and the 5gram model requires ~300GB of RAM. Furthermore, OPT 6.7b requires a GPU with at least ~12.4 GB of VRAM to load for inference.
Dependencies
CMake >= 3.14
gcc >= 10.1
pytorch == 1.13.1
To install CMake and gcc on Ubuntu, simply run:
sudo apt-get install build-essential
Install language model python package
Use the setup_lm.sh script in the root directory of this repository to create the b2txt_lm conda env and install the lm-decoder package to it. Before install, make sure that there is no build or fc_base directory in your language_model/runtime/server/x86 directory, as this may cause the build to fail.
Using a pretrained ngram language model
The language-model-standalone.py script included here is made to work with the evaluate_model.py script in the model_training directory. language-model-standalone.py will do the following when run:
- Initialize
opt-6.7bit on the specified gpu (--gpu_numberarg). The first time you run the script, it will automatically downloadopt-6.7bfrom huggingface. - Initialize the ngram language model (specified with the
--lm_patharg) - Connect to the
localhostredis server (or a different server, specified by the--redis_ipand--redis_portargs) - Wait to receive phoneme logits via redis, and then make word predictions and pass them back via redis.
To run the 1gram language model from the root directory of this repository:
conda activate b2txt_lm
python language_model/language-model-standalone.py --lm_path language_model/pretrained_language_models/openwebtext_1gram_lm_sil --do_opt --nbest 100 --acoustic_scale 0.325 --blank_penalty 90 --alpha 0.55 --redis_ip localhost --gpu_number 0
To run the 3gram language model from the root directory of this repository (requires ~60GB RAM):
conda activate b2txt_lm
python language_model/language-model-standalone.py --lm_path language_model/pretrained_language_models/openwebtext_3gram_lm_sil --do_opt --nbest 100 --acoustic_scale 0.325 --blank_penalty 90 --alpha 0.55 --redis_ip localhost --gpu_number 0
To run the 5gram language model from the root directory of this repository (requires ~300GB of RAM):
conda activate b2txt_lm
python language_model/language-model-standalone.py --lm_path language_model/pretrained_language_models/openwebtext_5gram_lm_sil --rescore --do_opt --nbest 100 --acoustic_scale 0.325 --blank_penalty 90 --alpha 0.55 --redis_ip localhost --gpu_number 0
Build a new phoneme-to-words ngram language model from scratch
-
First, build binaries for building the language model:
- Build SRILM:
cd srilm-1.7.3 export SRILM=$PWD make MAKE_PIC=yes World make cleanest export PATH=$PATH:$PWD/bin/i686-m64- Build openfst and other stuff:
cd runtime/server/x86 mkdir build cd build cmake .. make -j8 -
Build ngram LM:
cd ./examples/speech/s0/
run.sh output_dir dict_path train_corpus sil_prob formatted_train_corpus prune_threshold order