141 lines
		
	
	
		
			4.6 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			141 lines
		
	
	
		
			4.6 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <! $Id: ppl-scripts.1,v 1.9 2019/09/09 22:35:37 stolcke Exp $>
 | |
| <HTML>
 | |
| <HEADER>
 | |
| <TITLE>ppl-scripts</TITLE>
 | |
| <BODY>
 | |
| <H1>ppl-scripts</H1>
 | |
| <H2> NAME </H2>
 | |
| ppl-scripts, add-ppls, compare-ppls, compute-best-mix, compute-best-sentence-mix, filter-event-counts, hits-from-log, ppl-from-log, subtract-ppls - manipulate perplexities
 | |
| <H2> SYNOPSIS </H2>
 | |
| <PRE>
 | |
| <B>add-ppls</B> [ <I>ppl-file</I> ... ]
 | |
| <B>subtract-ppls</B> <I>ppl-file1</I> [ <I>ppl-file2</I> ... ]
 | |
| <B>ppl-from-log</B> [ <I>ppl-file</I> ... ]
 | |
| <B>hits-from-log</B> [ <I>ppl-file</I> ... ]
 | |
| <B>compare-ppls</B> [ <B>mindelta=</B><I>D</I> ] <I>ppl-file1</I> <I>ppl-file2</I>
 | |
| <B>compute-best-mix</B> [ <B>lambda='</B><I>l1 l2</I> ...<B>'</B> ] [ <B>precision=</B><I>P</I> ] \
 | |
| 	<I>ppl-file1</I> [ <I>ppl-file2</I> ... ]
 | |
| <B>compute-best-sentence-mix</B> [ <B>lambda='</B><I>l1 l2</I> ...<B>'</B> ] [ <B>precision=</B><I>P</I> ] \
 | |
| 	[ <B>addone=</B><I>c</I> ] <I>ppl-file1</I> [ <I>ppl-file2</I> ... ]
 | |
| <B>filter-event-counts</B> [ <B>order=</B><I>N</I> ] [ <B>escape=</B>\fIstring\P ] [ <I>counts</I> ... ]
 | |
| </PRE>
 | |
| <H2> DESCRIPTION </H2>
 | |
| These scripts process the output of the 
 | |
| <A HREF="ngram.1.html">ngram(1)</A>
 | |
| option
 | |
| <B> -ppl </B>
 | |
| to extract various useful information.
 | |
| They are particularly convenient in analyzing the performance (perplexity) of 
 | |
| language models on specific subsets of the test data,
 | |
| or to compare and combine multiple models.
 | |
| <P>
 | |
| <B> add-ppls </B>
 | |
| takes several ppl output files and computes an aggregate perplexity and
 | |
| corpus statistics.
 | |
| Its output is suitable for subsequent manipulation by
 | |
| <B> add-ppls </B>
 | |
| or
 | |
| <B>subtract-ppls</B>.<B></B><B></B><B></B>
 | |
| <P>
 | |
| <B> subtract-ppls </B>
 | |
| similarly computes an aggregate perplexity by removing the
 | |
| statistics of zero or more
 | |
| <I> ppl-file2 </I>
 | |
| from those in
 | |
| <I>ppl-file1</I>.<I></I><I></I><I></I>
 | |
| Its output is suitable for subsequent manipulation by
 | |
| <B> add-ppls </B>
 | |
| or
 | |
| <B>subtract-ppls</B>.<B></B><B></B><B></B>
 | |
| <P>
 | |
| <B> ppl-from-log </B>
 | |
| recomputes the total perplexities and statistics from individual
 | |
| lines in
 | |
| <B> ngram -debug 2 -ppl </B>
 | |
| output.
 | |
| Combined with some filtering of that output this allows computing 
 | |
| perplexities on interesting subsets of words.
 | |
| <P>
 | |
| <B> hits-from-log </B>
 | |
| computes N-gram hit rates from
 | |
| <B> ngram -debug 2 -ppl </B>
 | |
| output.
 | |
| <P>
 | |
| <B> compare-ppls </B>
 | |
| tallies the number of words for which two language models produce the same,
 | |
| higher, or lower probabilities.
 | |
| The input files should be 
 | |
| <B> ngram -debug 2 -ppl </B>
 | |
| output for the two models on the same test set.
 | |
| The parameter
 | |
| <I> D </I>
 | |
| is the minimum absolute difference for two log probabilities to be 
 | |
| considered different (the default is 0).
 | |
| <P>
 | |
| <B> compute-best-mix </B>
 | |
| takes the output of several
 | |
| <B> ngram -debug 2 -ppl </B>
 | |
| runs on the same test set and computes the optimal interpolation 
 | |
| weights for the corresponding models,
 | |
| i.e., the weights that minimize the perplexity of an interpolated model.
 | |
| Initial weights may be specified as
 | |
| <I>l1 l2 ...</I>.<I></I><I></I><I></I>
 | |
| The computation is iterative and stops when the interpolation weights
 | |
| change by less than
 | |
| <I> P </I>
 | |
| (default 0.001).
 | |
| <P>
 | |
| <B> compute-best-sentence-mix </B>
 | |
| similarly optimizes the weights for sentence-level interpolation of LMs.
 | |
| It requires input files generated by
 | |
| <B>ngram -debug 1 -ppl</B>.<B></B><B></B><B></B>
 | |
| (Sentence-level mixtures can be implemented using the 
 | |
| <B> ngram -hmm </B>
 | |
| option, by constructing a suitable HMM structure.)
 | |
| The 
 | |
| <B>addone=</B><I>c</I><B></B><I></I><B></B><I></I><B></B>
 | |
| option performs Laplace smoothing by adding 
 | |
| <I> c </I>
 | |
| to the estimated posterior counts for each model.
 | |
| <P>
 | |
| <B> filter-event-counts </B>
 | |
| prepares a count file for for perplexity computation.
 | |
| It removes counts that do not represent events to the LM.
 | |
| The 
 | |
| <B>order=</B>N<B></B><B></B><B></B>
 | |
| option specifies the maximal N-gram order to use.
 | |
| The effect of filtering is such that
 | |
| <PRE>
 | |
| 	ngram -order <I>N</I> -lm <I>LM</I> -ppl <I>TEXT</I>
 | |
| </PRE>
 | |
| and
 | |
| <PRE>
 | |
| 	ngram-count -order <I>N</I> -text <I>TEXT</I> -write - | \
 | |
| 	filter-event-counts order=<I>N</I> | \
 | |
| 	ngram -order <I>N</I> -lm <I>LM</I> -counts -
 | |
| </PRE>
 | |
| yield the same result.
 | |
| The 
 | |
| <B> escape= </B>
 | |
| option specifies a string that causes all input lines beginning with 
 | |
| that string to be passed through
 | |
| (useful in combination with
 | |
| <B>ngram -escape</B>).<B></B><B></B><B></B>
 | |
| </PRE>
 | |
| <H2> SEE ALSO </H2>
 | |
| <A HREF="ngram.1.html">ngram(1)</A>, <A HREF="ngram-count.1.html">ngram-count(1)</A>.
 | |
| <H2> BUGS </H2>
 | |
| Most scripts depend on the idiosyncrasies of
 | |
| <B> ngram -ppl </B>
 | |
| output.
 | |
| <H2> AUTHOR </H2>
 | |
| Andreas Stolcke <stolcke@icsi.berkeley.edu>
 | |
| <BR>
 | |
| Copyright (c) 1995-2009 SRI International
 | |
| <BR>
 | |
| Copyright (c) 2011-2016 Andreas Stolcke
 | |
| <BR>
 | |
| Copyright (c) 2011-2016 Microsoft Corp.
 | |
| </BODY>
 | |
| </HTML>
 | 
