158 lines
		
	
	
		
			5.9 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			158 lines
		
	
	
		
			5.9 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <! $Id: lm-scripts.1,v 1.9 2019/09/09 22:35:36 stolcke Exp $>
 | |
| <HTML>
 | |
| <HEADER>
 | |
| <TITLE>lm-scripts</TITLE>
 | |
| <BODY>
 | |
| <H1>lm-scripts</H1>
 | |
| <H2> NAME </H2>
 | |
| lm-scripts, add-dummy-bows, change-lm-vocab, empty-sentence-lm, get-unigram-probs, make-hiddens-lm, make-lm-subset, make-sub-lm, remove-lowprob-ngrams, reverse-lm, sort-lm - manipulate N-gram language models
 | |
| <H2> SYNOPSIS </H2>
 | |
| <PRE>
 | |
| <B>add-dummy-bows</B> [ <I>lm-file</I> ] <B>></B> <I>new-lm-file</I>
 | |
| <B>change-lm-vocab</B> <B>-vocab</B> <I>vocab</I> <B>-lm</B> <I>lm-file</I> <B>-write-lm</B> <I>new-lm-file</I> \
 | |
| 	[ <B>-tolower</B> ] [ <B>-subset</B> ] [ <I>ngram-options</I> ... ]
 | |
| <B>empty-sentence-lm</B> <B>-prob</B> <I>p</I> <B>-lm</B> <I>lm-file</I> <B>-write-lm</B> <I>new-lm-file</I> \
 | |
| 	[ <I>ngram-options</I> ... ]
 | |
| <B>get-unigram-probs</B> [ <B>linear=1</B> ] [ <I>lm-file</I> ]
 | |
| <B>make-hiddens-lm</B> [ <I>lm-file</I> ] <B>></B> <I>hiddens-lm-file</I>
 | |
| <B>make-lm-subset</B> <I>count-file</I>|<B>-</B> [ lm-file |<B>-</B> ] <B>></B> <I>new-lm-file</I>
 | |
| <B>make-sub-lm</B> [ <B>maxorder=</B><I>N</I> ] [ <I>lm-file</I> ] <B>></B> <I>new-lm-file</I>
 | |
| <B>remove-lowprob-ngrams</B> [ <I>lm-file</I> ] <B>></B> <I>new-lm-file</I>
 | |
| <B>reverse-lm</B>  [ <I>lm-file</I> ] <B>></B> <I>new-lm-file</I>
 | |
| <B>sort-lm</B> [ <I>lm-file</I> ] <B>></B> <I>sorted-lm-file</I>
 | |
| </PRE>
 | |
| <H2> DESCRIPTION </H2>
 | |
| These scripts perform various useful manipulations on N-gram models
 | |
| in their textual representation.
 | |
| Most operate on backoff N-grams in ARPA
 | |
| <A HREF="ngram-format.5.html">ngram-format(5)</A>.
 | |
| <P>
 | |
| Since these tools are implemented as scripts they don't automatically
 | |
| input or output compressed model files correctly, unlike the main
 | |
| SRILM tools.
 | |
| However, since most scripts work with data from standard input or
 | |
| to standard output (by leaving out the file argument, or specifying it 
 | |
| as ``-'') it is easy to combine them with 
 | |
| <A HREF="gunzip.1.html">gunzip(1)</A>
 | |
| or
 | |
| <A HREF="gzip.1.html">gzip(1)</A>
 | |
| on the command line.
 | |
| <P>
 | |
| Also note that many of the scripts take their options with the 
 | |
| <A HREF="gawk.1.html">gawk(1)</A>
 | |
| syntax
 | |
| <I>option</I><B>=</B><I>value</I><B></B><I></I><B></B><I></I>
 | |
| instead of the more common
 | |
| <B>-</B><I>option</I><B></B><I></I><B></B><I></I><B></B>
 | |
| <I>value</I>.<I></I><I></I><I></I>
 | |
| <P>
 | |
| <B> add-dummy-bows </B>
 | |
| adds dummy backoff weights to N-grams, even where they 
 | |
| are not required, to satisfy some broken software that expects
 | |
| backoff weights on all N-grams (except those of highest order).
 | |
| <P>
 | |
| <B> change-lm-vocab </B>
 | |
| modifies the vocabulary of an LM to be that in 
 | |
| <I>vocab</I>.<I></I><I></I><I></I>
 | |
| Any N-grams containing out-of-vocabulary words are removed,
 | |
| new words receive a unigram probability, and the model
 | |
| is renormalized.
 | |
| The 
 | |
| <B> -tolower </B>
 | |
| option causes case distinctions to be ignored.
 | |
| <B> -subset </B>
 | |
| only removes words from the LM vocabulary, without adding any.
 | |
| Any remaining
 | |
| <I> ngram-options </I>
 | |
| are passes to
 | |
| <A HREF="ngram.1.html">ngram(1)</A>,
 | |
| and can be used to set debugging level, N-gram order, etc.
 | |
| <P>
 | |
| <B> empty-sentence-lm </B>
 | |
| modifies an LM so that it allows the empty sentence with 
 | |
| probability
 | |
| <I>p</I>.<I></I><I></I><I></I>
 | |
| This is useful to modify existing LMs that are trained on non-empty
 | |
| sentences only.
 | |
| <I> ngram-options </I>
 | |
| are passes to
 | |
| <A HREF="ngram.1.html">ngram(1)</A>,
 | |
| and can be used to set debugging level, N-gram order, etc.
 | |
| <P>
 | |
| <B> make-hiddens-lm </B>
 | |
| constructs an N-gram model that can be used with the
 | |
| <B> ngram -hiddens </B>
 | |
| option.
 | |
| The new model contains intra-utterance sentence boundary
 | |
| tags ``<#s>'' with the same probability as the original model
 | |
| had final sentence tags </s>.
 | |
| Also, utterance-initial words are not conditioned on <s> and
 | |
| there is no penalty associated with utterance-final </s>.
 | |
| Such as model might work better it the test corpus is segmented 
 | |
| at places other than proper <s> boundaries.
 | |
| <P>
 | |
| <B> make-lm-subset </B>
 | |
| forms a new LM containing only the N-grams found in the 
 | |
| <I>count-file</I>,<I></I><I></I><I></I>
 | |
| in 
 | |
| <A HREF="ngram-count.1.html">ngram-count(1)</A>
 | |
| format.
 | |
| The result still needs to be renormalized with
 | |
| <B> ngram -renorm </B>
 | |
| (which will also adjust the N-gram counts in the header).
 | |
| <P>
 | |
| <B> make-sub-lm </B>
 | |
| removes N-grams of order exceeding
 | |
| <I>N</I>.<I></I><I></I><I></I>
 | |
| This function is now redundant, since
 | |
| all SRILM tools can do this implicitly (without using extra memory 
 | |
| and very small time overhead) when reading N-gram models
 | |
| with the appropriate
 | |
| <B> -order </B>
 | |
| parameter.
 | |
| <P>
 | |
| <B> remove-lowprob-ngrams </B>
 | |
| eliminates N-grams whose probability is lower than that which they
 | |
| would receive through backoff.
 | |
| This is useful when building finite-state networks for N-gram
 | |
| models.
 | |
| However, this function is now performed much faster by 
 | |
| <A HREF="ngram.1.html">ngram(1)</A>
 | |
| with the
 | |
| <B> -prune-lowprobs </B>
 | |
| option.
 | |
| <P>
 | |
| <B> reverse-lm </B>
 | |
| produces a new LM that generates sentences with probabilities equal
 | |
| to the reversed sentences in the input model.
 | |
| <P>
 | |
| <B> sort-lm </B>
 | |
| sorts the n-grams in an LM in lexicographic order (left-most words being
 | |
| the most significant).
 | |
| This is not a requirement for SRILM, but might be necessary for some 
 | |
| other LM software.
 | |
| (The LMs output by SRILM are sorted somewhat differently, reflecting 
 | |
| the internal data structures used; that is also the order that should give
 | |
| best cache utilization when using SRILM to read models.)
 | |
| <P>
 | |
| <B> get-unigram-probs </B>
 | |
| extracts the unigram probabilities in a simple table format
 | |
| from a backoff language model.
 | |
| The 
 | |
| <B> linear=1 </B>
 | |
| option causes probabilities to be output on a linear (instead of log) scale.
 | |
| <H2> SEE ALSO </H2>
 | |
| <A HREF="ngram-format.5.html">ngram-format(5)</A>, <A HREF="ngram.1.html">ngram(1)</A>.
 | |
| <H2> BUGS </H2>
 | |
| These are quick-and-dirty scripts, what do you expect?
 | |
| <BR>
 | |
| <B> reverse-lm </B>
 | |
| supports only bigram LMs, and can produce improper probability estimates 
 | |
| as a result of inconsistent marginals in the input model.
 | |
| <H2> AUTHOR </H2>
 | |
| Andreas Stolcke <stolcke@icsi.berkeley.edu>
 | |
| <BR>
 | |
| Copyright (c) 1995-2006 SRI International
 | |
| </BODY>
 | |
| </HTML>
 | 
