185 lines
6.7 KiB
Plaintext
185 lines
6.7 KiB
Plaintext
|
|
SRI Object-oriented Language Modeling Libraries and Tools
|
|
|
|
BUILD AND INSTALL
|
|
|
|
See the INSTALL file in the top-level directory.
|
|
|
|
FILES
|
|
|
|
All mixed-case .cc and .h files contain C++ code for the liboolm
|
|
library.
|
|
All lower-case .cc files in this directory correspond to executable commands.
|
|
Each program is option driven. The -help option prints a list of
|
|
available options and their meaning.
|
|
|
|
N-GRAM MODELS
|
|
|
|
Here we only cover arbitary-order n-grams (no classes yet, sorry).
|
|
|
|
ngram-count manipulates n-gram counts and
|
|
estimates backoff models
|
|
ngram-merge merges count files (only needed for large
|
|
corpora/vocabularies)
|
|
ngram computes probabilities given a backoff model
|
|
(also does mixing of backoff models)
|
|
|
|
Below are some typical command lines.
|
|
|
|
NGRAM COUNT MANIPULATION
|
|
|
|
ngram-count -order 4 -text corpus -write corpus.ngrams
|
|
|
|
Counts all n-grams up to length 4 in corpus and writes them to
|
|
corpus.ngrams. The default n-gram order (if not specified) is 3.
|
|
|
|
ngram-count -vocab 30k.vocab -order 4 -text corpus -write corpus.ngrams
|
|
|
|
Same, but restricts the vocabulary to what is listed in 30k.vocab
|
|
(one word per line). Other words are replaced with the <unk> token.
|
|
|
|
ngram-count -read corpus.ngrams -vocab 20k.vocab -write-order 3 \
|
|
-write corpus.20k.3grams
|
|
|
|
Reads the counts from corpus.ngrams, maps them to the indicated
|
|
vocabulary, and writes only the trigrams back to a file.
|
|
|
|
ngram-count -text corpus -write1 corpus.1grams -write2 corpus.2grams \
|
|
-write3 corpus.3grams
|
|
|
|
writes unigrams, bigrams, and trigrams to separate files, in one
|
|
pass. Usually there is no reason to keep these separate, which is
|
|
why by default ngrams of all lengths are written out together.
|
|
The -recompute flag will regenerate the lower-order counts from the
|
|
highest order one by summation by prefixes.
|
|
|
|
The -read and -text options are additive, and can be used to merge
|
|
new counts with old. Furthermore, repeated n-grams read with -read
|
|
are also additive. Thus,
|
|
|
|
cat counts1 counts2 ... | ngram-count -read -
|
|
|
|
will merge the counts from counts1, counts2, ...
|
|
|
|
All file reading and writing uses the zio routines, so argument "-"
|
|
stands for stdin/stdout, and .Z and .gz files are handled correctly.
|
|
|
|
For very large count files (due to large corpus/vocabulary) this
|
|
method of merging counts in-memory is not suitable. Alternatively,
|
|
counts can sorted and then merged.
|
|
First, generate counts for portions of the training corpus and
|
|
save them with -sort:
|
|
|
|
ngram-count -text part1.text -sort -write part1.ngrams.Z
|
|
ngram-count -text part2.text -sort -write part2.ngrams.Z
|
|
|
|
Then combine these with
|
|
|
|
ngram-merge part?.ngrams.Z > all.ngrams.Z
|
|
|
|
(Although ngram-merge can deal with any number of files >= 2 ,
|
|
it is most efficient to combined two parts at a time, then
|
|
from the resulting new count files, again two at a time, using
|
|
a binary tree merging scheme.)
|
|
|
|
BACKOFF MODEL ESTIMATION
|
|
|
|
ngram-count -order 2 -read corpus.counts -lm corpus.bo
|
|
|
|
generates a bigram backoff model in ARPA (aka Doug Paul) format
|
|
from corpus.counts and writes is to corpus.bo.
|
|
|
|
If counts fit into memory (and hence there is no reason for the
|
|
merging schemes described above) it is more convenient and faster
|
|
to go directly from training text to model:
|
|
|
|
ngram-count -text corpus -lm corpus.bo
|
|
|
|
The built-in discounting method used in building backoff models
|
|
is Good Turing. The lower exclusion cutoffs can be set with
|
|
options (-gt1min ... -gt6min), the upper discounting cutoffs are
|
|
selected with -gt1max ... -gt6max. Reasonable defaults are
|
|
provided that can be displayed as part of the -help output.
|
|
|
|
When using limited vocabularies it is recommended to compute the
|
|
discount coeffiecients on the unlimited vocabulary (at least for
|
|
the unigrams) and then apply them to the limited vocabulary
|
|
(otherwise the vocabulary truncation would produce badly skewed
|
|
counts frequencies at the low end that would break the GT algorithm.)
|
|
For this reason, discounting parameters can be saved to files and
|
|
read back in.
|
|
For example
|
|
|
|
ngram-count -text corpus -gt1 gt1.params -gt2 gt2.params \
|
|
-gt3 gt3.params
|
|
|
|
saves the discounting parameters for unigrams, bigrams and trigams
|
|
in the files as indicated. These are short files that can be
|
|
edited, e.g., to adjust the lower and upper discounting cutoffs.
|
|
Then the limited vocabulary backoff model is estimated using these
|
|
saved parameters
|
|
|
|
ngram-count -text corpus -vocab 20k.vocab \
|
|
-gt1 gt1.params -gt2 gt2.params gt3.params -lm corpus.20k.bo
|
|
|
|
MODEL EVALUATION
|
|
|
|
The ngram program uses a backoff model to compute probabilities and
|
|
perplexity on test data.
|
|
|
|
ngram -lm some.bo -ppl test.corpus
|
|
|
|
computes the perplexity on test.corpus according to model some.bo.
|
|
The flag -debug controls the amount of information output:
|
|
|
|
-debug 0 only overall statistics
|
|
-debug 1 statistics for each test sentence
|
|
-debug 2 probabilities for each word
|
|
-debug 3 verify that word probabilities over the
|
|
entire vocabulary sum to 1 for each context
|
|
|
|
ngram also understands the -order flag to set the maximum ngram
|
|
order effectively used by the model. The default is 3.
|
|
It has to be explicitly reset to use ngrams of higher order, even
|
|
if the file specified with -lm contains higher order ngrams.
|
|
|
|
The flag -skipoovs establishes compatibility with broken behavior
|
|
in some old software. It should only be used with bo model files
|
|
produced with the old tools. It will
|
|
|
|
- let OOVs be counted as such even when the model has a probability
|
|
for <unk>
|
|
- skip not just the OOV but the entire n-gram context in which any
|
|
OOVs occur (instead of backing off on OOV contexts).
|
|
|
|
OTHER MODEL OPERATIONS
|
|
|
|
ngram performs a few other operations on backoff models.
|
|
|
|
ngram -lm bo1 -mix-lm bo2 -lambda 0.2 -write-lm bo3
|
|
|
|
produces a new model in bo3 that is the interpolation of bo1 and bo2
|
|
with a weight of 0.2 (for bo1).
|
|
|
|
ngram -lm bo -renorm -write-lm bo.new
|
|
|
|
recomputes the backoff weights in the model bo (thus normalizing
|
|
probabilities to 1) and leaves the result in bo.new.
|
|
|
|
API FOR LANGUAGE MODELS
|
|
|
|
These programs are just examples of how to use the object-oriented
|
|
language model library currently under construction. To use the API
|
|
one would have to read the various .h files and how the interfaces
|
|
are used in the example progams. No comprehensive documentation is
|
|
available as yet. Sorry.
|
|
|
|
AVAILABILITY
|
|
|
|
This code is Copyright SRI International, but is available free of
|
|
charge for non-profit use. See the License file in the top-level
|
|
direcory for the terms of use.
|
|
|
|
Andreas Stolcke
|
|
$Date: 1999/07/31 18:48:33 $
|