185 lines
		
	
	
		
			6.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
		
		
			
		
	
	
			185 lines
		
	
	
		
			6.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
|   | 
 | ||
|  | SRI Object-oriented Language Modeling Libraries and Tools | ||
|  | 
 | ||
|  | BUILD AND INSTALL | ||
|  | 
 | ||
|  | See the INSTALL file in the top-level directory. | ||
|  | 
 | ||
|  | FILES | ||
|  | 
 | ||
|  | All mixed-case .cc and .h files contain C++ code for the liboolm | ||
|  | library.  | ||
|  | All lower-case .cc files in this directory correspond to executable commands. | ||
|  | Each program is option driven. The -help option prints a list of  | ||
|  | available options and their meaning. | ||
|  | 
 | ||
|  | N-GRAM MODELS | ||
|  | 
 | ||
|  | Here we only cover arbitary-order n-grams (no classes yet, sorry). | ||
|  | 
 | ||
|  | 	ngram-count		manipulates n-gram counts and | ||
|  | 				estimates backoff models | ||
|  | 	ngram-merge		merges count files (only needed for large | ||
|  | 				corpora/vocabularies) | ||
|  | 	ngram			computes probabilities given a backoff model | ||
|  | 				(also does mixing of backoff models) | ||
|  | 
 | ||
|  | Below are some typical command lines. | ||
|  | 
 | ||
|  | NGRAM COUNT MANIPULATION | ||
|  | 
 | ||
|  |        ngram-count -order 4 -text corpus -write corpus.ngrams | ||
|  | 
 | ||
|  |    Counts all n-grams up to length 4 in corpus and writes them to  | ||
|  |    corpus.ngrams. The default n-gram order (if not specified) is 3. | ||
|  | 
 | ||
|  |        ngram-count -vocab 30k.vocab -order 4 -text corpus -write corpus.ngrams | ||
|  | 
 | ||
|  |    Same, but restricts the vocabulary to what is listed in 30k.vocab | ||
|  |    (one word per line).  Other words are replaced with the <unk> token. | ||
|  | 
 | ||
|  |        ngram-count -read corpus.ngrams -vocab 20k.vocab -write-order 3 \ | ||
|  | 			       -write corpus.20k.3grams | ||
|  | 
 | ||
|  |     Reads the counts from corpus.ngrams, maps them to the indicated | ||
|  |     vocabulary, and writes only the trigrams back to a file. | ||
|  | 
 | ||
|  | 	ngram-count -text corpus -write1 corpus.1grams -write2 corpus.2grams \ | ||
|  | 				       -write3 corpus.3grams | ||
|  | 
 | ||
|  |     writes unigrams, bigrams, and trigrams to separate files, in one | ||
|  |     pass.  Usually there is no reason to keep these separate, which is | ||
|  |     why by default ngrams of all lengths are written out together. | ||
|  |     The -recompute flag will regenerate the lower-order counts from the | ||
|  |     highest order one by summation by prefixes. | ||
|  | 
 | ||
|  |     The -read and -text options are additive, and can be used to merge | ||
|  |     new counts with old.  Furthermore, repeated n-grams read with -read | ||
|  |     are also additive.  Thus, | ||
|  | 
 | ||
|  | 	cat counts1 counts2 ... | ngram-count -read -   | ||
|  | 
 | ||
|  |     will merge the counts from counts1, counts2, ... | ||
|  | 
 | ||
|  |     All file reading and writing uses the zio routines, so argument "-" | ||
|  |     stands for stdin/stdout, and .Z and .gz files are handled correctly. | ||
|  | 
 | ||
|  |     For very large count files (due to large corpus/vocabulary) this  | ||
|  |     method of merging counts in-memory is not suitable.  Alternatively, | ||
|  |     counts can sorted and then merged. | ||
|  |     First, generate counts for portions of the training corpus and  | ||
|  |     save them with -sort: | ||
|  | 
 | ||
|  | 	ngram-count -text part1.text -sort -write part1.ngrams.Z | ||
|  | 	ngram-count -text part2.text -sort -write part2.ngrams.Z | ||
|  | 
 | ||
|  |     Then combine these with | ||
|  | 
 | ||
|  | 	ngram-merge part?.ngrams.Z > all.ngrams.Z | ||
|  | 
 | ||
|  |     (Although ngram-merge can deal with any number of files >= 2 , | ||
|  |     it is most efficient to combined two parts at a time, then | ||
|  |     from the resulting new count files, again two at a time, using | ||
|  |     a binary tree merging scheme.) | ||
|  | 
 | ||
|  | BACKOFF MODEL ESTIMATION | ||
|  | 
 | ||
|  |         ngram-count -order 2 -read corpus.counts -lm corpus.bo | ||
|  | 
 | ||
|  |     generates a bigram backoff model in ARPA (aka Doug Paul) format  | ||
|  |     from corpus.counts and writes is to corpus.bo. | ||
|  | 
 | ||
|  |     If counts fit into memory (and hence there is no reason for the | ||
|  |     merging schemes described above) it is more convenient and faster | ||
|  |     to go directly from training text to model: | ||
|  | 
 | ||
|  | 	ngram-count -text corpus -lm corpus.bo | ||
|  |        | ||
|  |     The built-in discounting method used in building backoff models | ||
|  |     is Good Turing.  The lower exclusion cutoffs can be set with | ||
|  |     options (-gt1min ... -gt6min), the upper discounting cutoffs are | ||
|  |     selected with -gt1max ... -gt6max.  Reasonable defaults are | ||
|  |     provided that can be displayed as part of the -help output. | ||
|  | 
 | ||
|  |     When using limited vocabularies it is recommended to compute the | ||
|  |     discount coeffiecients on the unlimited vocabulary (at least for | ||
|  |     the unigrams) and then apply them to the limited vocabulary | ||
|  |     (otherwise the vocabulary truncation would produce badly skewed | ||
|  |     counts frequencies at the low end that would break the GT algorithm.) | ||
|  |     For this reason, discounting parameters can be saved to files and | ||
|  |     read back in. | ||
|  |     For example | ||
|  | 
 | ||
|  | 	ngram-count -text corpus -gt1 gt1.params -gt2 gt2.params \ | ||
|  | 					-gt3 gt3.params | ||
|  | 
 | ||
|  |     saves the discounting parameters for unigrams, bigrams and trigams | ||
|  |     in the files as indicated.  These are short files that can be | ||
|  |     edited, e.g., to adjust the lower and upper discounting cutoffs. | ||
|  |     Then the limited vocabulary backoff model is estimated using these | ||
|  |     saved parameters | ||
|  | 
 | ||
|  | 	ngram-count -text corpus -vocab 20k.vocab \ | ||
|  | 		-gt1 gt1.params -gt2 gt2.params gt3.params -lm corpus.20k.bo | ||
|  | 
 | ||
|  | MODEL EVALUATION | ||
|  | 
 | ||
|  |     The ngram program uses a backoff model to compute probabilities and | ||
|  |     perplexity on test data. | ||
|  | 
 | ||
|  | 	ngram -lm some.bo -ppl test.corpus  | ||
|  | 
 | ||
|  |     computes the perplexity on test.corpus according to model some.bo. | ||
|  |     The flag -debug controls the amount of information output: | ||
|  | 
 | ||
|  | 		-debug 0	only overall statistics | ||
|  | 		-debug 1	statistics for each test sentence | ||
|  | 		-debug 2	probabilities for each word | ||
|  | 		-debug 3	verify that word probabilities over the | ||
|  | 				entire vocabulary sum to 1 for each context | ||
|  | 
 | ||
|  |     ngram also understands the -order flag to set the maximum ngram | ||
|  |     order effectively used by the model.  The default is 3. | ||
|  |     It has to be explicitly reset to use ngrams of higher order, even | ||
|  |     if the file specified with -lm contains higher order ngrams. | ||
|  | 
 | ||
|  |     The flag -skipoovs establishes compatibility with broken behavior | ||
|  |     in some old software.  It should only be used with bo model files  | ||
|  |     produced with the old tools.  It will | ||
|  | 
 | ||
|  |      - let OOVs be counted as such even when the model has a probability | ||
|  |        for <unk> | ||
|  |      - skip not just the OOV but the entire n-gram context in which any | ||
|  |        OOVs occur (instead of backing off on OOV contexts). | ||
|  | 
 | ||
|  | OTHER MODEL OPERATIONS | ||
|  | 
 | ||
|  |     ngram performs a few other operations on backoff models. | ||
|  | 
 | ||
|  | 	ngram -lm bo1 -mix-lm bo2 -lambda 0.2 -write-lm bo3 | ||
|  | 
 | ||
|  |     produces a new model in bo3 that is the interpolation of bo1 and bo2 | ||
|  |     with a weight of 0.2 (for bo1). | ||
|  | 
 | ||
|  | 	ngram -lm bo -renorm -write-lm bo.new | ||
|  | 
 | ||
|  |     recomputes the backoff weights in the model bo (thus normalizing | ||
|  |     probabilities to 1) and leaves the result in bo.new. | ||
|  | 
 | ||
|  | API FOR LANGUAGE MODELS | ||
|  | 
 | ||
|  |     These programs are just examples of how to use the object-oriented | ||
|  |     language model library currently under construction.  To use the API | ||
|  |     one would have to read the various .h files and how the interfaces | ||
|  |     are used in the example progams.  No comprehensive documentation is | ||
|  |     available as yet.  Sorry. | ||
|  | 
 | ||
|  | AVAILABILITY | ||
|  | 
 | ||
|  |     This code is Copyright SRI International, but is available free of | ||
|  |     charge for non-profit use.  See the License file in the top-level | ||
|  |     direcory for the terms of use. | ||
|  | 
 | ||
|  | Andreas Stolcke | ||
|  | $Date: 1999/07/31 18:48:33 $ |