From 2bbd0d0523217e15492ba8dde6c1fd9fe42ed044 Mon Sep 17 00:00:00 2001 From: tscizzlebg <54290732+tscizzlebg@users.noreply.github.com> Date: Wed, 2 Jul 2025 15:14:17 -0700 Subject: [PATCH 1/2] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 421d84d..cb90c2f 100644 --- a/README.md +++ b/README.md @@ -48,7 +48,7 @@ Please download these datasets from [Dryad](https://datadryad.org/stash/dataset/ sudo apt-get update sudo apt-get install redis ``` - - Turn of autorestarting for the redis server in terminal: + - Turn off autorestarting for the redis server in terminal: - `sudo systemctl disable redis-server` - `CMake >= 3.14` and `gcc >= 10.1` are required for the ngram language model installation. You can install these on linux with `sudo apt-get install build-essential`. @@ -64,4 +64,4 @@ We use an ngram language model plus rescoring via the [Facebook OPT 6.7b](https: Our Kaldi-based ngram implementation requires a different version of torch than our model training pipeline, so running the ngram language models requires an additional seperate python conda environment. To create this conda environment, run the following command from the root directory of this repository. For more detailed instructions, see the README.md in the `language_model` subdirectory. ```bash ./setup_lm.sh -``` \ No newline at end of file +``` From 6dc3a1445b422f6e8f5da3d9dd2dfd00d30e02bf Mon Sep 17 00:00:00 2001 From: Tyler Date: Wed, 2 Jul 2025 16:42:00 -0700 Subject: [PATCH 2/2] README and script messages --- README.md | 4 ++++ setup.sh | 6 +++++- setup_lm.sh | 6 +++++- 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index cb90c2f..cdd4c56 100644 --- a/README.md +++ b/README.md @@ -58,6 +58,8 @@ To create a conda environment with the necessary dependencies, run the following ./setup.sh ``` +Verify it worked by activating the conda environment with the command `conda activate b2txt25`. + ## Python environment setup for ngram language model and OPT rescoring We use an ngram language model plus rescoring via the [Facebook OPT 6.7b](https://huggingface.co/facebook/opt-6.7b) LLM. A pretrained 1gram language model is included in this repository at `language_model/pretrained_language_models/openwebtext_1gram_lm_sil`. Pretrained 3gram and 5gram language models are available for download [here](https://datadryad.org/dataset/doi:10.5061/dryad.x69p8czpq) (`languageModel.tar.gz` and `languageModel_5gram.tar.gz`). Note that the 3gram model requires ~60GB of RAM, and the 5gram model requires ~300GB of RAM. Furthermore, OPT 6.7b requires a GPU with at least ~12.4 GB of VRAM to load for inference. @@ -65,3 +67,5 @@ Our Kaldi-based ngram implementation requires a different version of torch than ```bash ./setup_lm.sh ``` + +Verify it worked by activating the conda environment with the command `conda activate b2txt25_lm`. diff --git a/setup.sh b/setup.sh index d01e6ea..c8b84b6 100755 --- a/setup.sh +++ b/setup.sh @@ -34,4 +34,8 @@ pip install \ transformers==4.53.0 \ tokenizers==0.21.2 \ accelerate==1.8.1 \ - bitsandbytes==0.46.0 \ No newline at end of file + bitsandbytes==0.46.0 + +echo +echo "Setup complete! Verify it worked by activating the conda environment with the command 'conda activate b2txt25'." +echo diff --git a/setup_lm.sh b/setup_lm.sh index 0bbdb10..17f109c 100755 --- a/setup_lm.sh +++ b/setup_lm.sh @@ -53,4 +53,8 @@ cd language_model/runtime/server/x86 python setup.py install # cd back to the root directory -cd ../../../.. \ No newline at end of file +cd ../../../.. + +echo +echo "Setup complete! Verify it worked by activating the conda environment with the command 'conda activate b2txt25_lm'." +echo