135 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
			
		
		
	
	
			135 lines
		
	
	
		
			4.1 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
| .\" $Id: ppl-scripts.1,v 1.9 2019/09/09 22:35:37 stolcke Exp $
 | |
| .TH ppl-scripts 1 "$Date: 2019/09/09 22:35:37 $" "SRILM Tools"
 | |
| .SH NAME
 | |
| ppl-scripts, add-ppls, compare-ppls, compute-best-mix, compute-best-sentence-mix, filter-event-counts, hits-from-log, ppl-from-log, subtract-ppls \- manipulate perplexities
 | |
| .SH SYNOPSIS
 | |
| .nf
 | |
| \fBadd-ppls\fP [ \fIppl-file\fP ... ]
 | |
| \fBsubtract-ppls\fP \fIppl-file1\fP [ \fIppl-file2\fP ... ]
 | |
| \fBppl-from-log\fP [ \fIppl-file\fP ... ]
 | |
| \fBhits-from-log\fP [ \fIppl-file\fP ... ]
 | |
| \fBcompare-ppls\fP [ \fBmindelta=\fP\fID\fP ] \fIppl-file1\fP \fIppl-file2\fP
 | |
| \fBcompute-best-mix\fP [ \fBlambda='\fP\fIl1 l2\fP ...\fB'\fP ] [ \fBprecision=\fP\fIP\fP ] \\
 | |
| 	\fIppl-file1\fP [ \fIppl-file2\fP ... ]
 | |
| \fBcompute-best-sentence-mix\fP [ \fBlambda='\fP\fIl1 l2\fP ...\fB'\fP ] [ \fBprecision=\fP\fIP\fP ] \\
 | |
| 	[ \fBaddone=\fP\fIc\fP ] \fIppl-file1\fP [ \fIppl-file2\fP ... ]
 | |
| \fBfilter-event-counts\fP [ \fBorder=\fP\fIN\fP ] [ \fBescape=\fP\fIstring\P ] [ \fIcounts\fP ... ]
 | |
| .fi
 | |
| .SH DESCRIPTION
 | |
| These scripts process the output of the 
 | |
| .BR ngram (1)
 | |
| option
 | |
| .B \-ppl
 | |
| to extract various useful information.
 | |
| They are particularly convenient in analyzing the performance (perplexity) of 
 | |
| language models on specific subsets of the test data,
 | |
| or to compare and combine multiple models.
 | |
| .PP
 | |
| .B add-ppls 
 | |
| takes several ppl output files and computes an aggregate perplexity and
 | |
| corpus statistics.
 | |
| Its output is suitable for subsequent manipulation by
 | |
| .B add-ppls 
 | |
| or
 | |
| .BR subtract-ppls .
 | |
| .PP
 | |
| .B subtract-ppls
 | |
| similarly computes an aggregate perplexity by removing the
 | |
| statistics of zero or more
 | |
| .I ppl-file2
 | |
| from those in
 | |
| .IR ppl-file1 .
 | |
| Its output is suitable for subsequent manipulation by
 | |
| .B add-ppls 
 | |
| or
 | |
| .BR subtract-ppls .
 | |
| .PP
 | |
| .B ppl-from-log
 | |
| recomputes the total perplexities and statistics from individual
 | |
| lines in
 | |
| .B "ngram \-debug 2 \-ppl"
 | |
| output.
 | |
| Combined with some filtering of that output this allows computing 
 | |
| perplexities on interesting subsets of words.
 | |
| .PP
 | |
| .B hits-from-log
 | |
| computes N-gram hit rates from
 | |
| .B "ngram \-debug 2 \-ppl"
 | |
| output.
 | |
| .PP
 | |
| .B compare-ppls
 | |
| tallies the number of words for which two language models produce the same,
 | |
| higher, or lower probabilities.
 | |
| The input files should be 
 | |
| .B "ngram \-debug 2 \-ppl"
 | |
| output for the two models on the same test set.
 | |
| The parameter
 | |
| .I D
 | |
| is the minimum absolute difference for two log probabilities to be 
 | |
| considered different (the default is 0).
 | |
| .PP
 | |
| .B compute-best-mix
 | |
| takes the output of several
 | |
| .B "ngram \-debug 2 \-ppl"
 | |
| runs on the same test set and computes the optimal interpolation 
 | |
| weights for the corresponding models,
 | |
| i.e., the weights that minimize the perplexity of an interpolated model.
 | |
| Initial weights may be specified as
 | |
| .IR "l1 l2 ..." .
 | |
| The computation is iterative and stops when the interpolation weights
 | |
| change by less than
 | |
| .I P 
 | |
| (default 0.001).
 | |
| .PP
 | |
| .B compute-best-sentence-mix
 | |
| similarly optimizes the weights for sentence-level interpolation of LMs.
 | |
| It requires input files generated by
 | |
| .BR "ngram \-debug 1 \-ppl" .
 | |
| (Sentence-level mixtures can be implemented using the 
 | |
| .B "ngram \-hmm"
 | |
| option, by constructing a suitable HMM structure.)
 | |
| The 
 | |
| .BI addone= c
 | |
| option performs Laplace smoothing by adding 
 | |
| .I c
 | |
| to the estimated posterior counts for each model.
 | |
| .PP
 | |
| .B filter-event-counts
 | |
| prepares a count file for for perplexity computation.
 | |
| It removes counts that do not represent events to the LM.
 | |
| The 
 | |
| .BR order= N
 | |
| option specifies the maximal N-gram order to use.
 | |
| The effect of filtering is such that
 | |
| .nf
 | |
| 	ngram -order \fIN\fP -lm \fILM\fP -ppl \fITEXT\fP
 | |
| .fi
 | |
| and
 | |
| .nf
 | |
| 	ngram-count -order \fIN\fP -text \fITEXT\fP -write - | \\
 | |
| 	filter-event-counts order=\fIN\fP | \\
 | |
| 	ngram -order \fIN\fP -lm \fILM\fP -counts -
 | |
| .fi
 | |
| yield the same result.
 | |
| The 
 | |
| .B escape=
 | |
| option specifies a string that causes all input lines beginning with 
 | |
| that string to be passed through
 | |
| (useful in combination with
 | |
| .BR "ngram \-escape" ).
 | |
| .fi
 | |
| .SH "SEE ALSO"
 | |
| ngram(1), ngram-count(1).
 | |
| .SH BUGS
 | |
| Most scripts depend on the idiosyncrasies of
 | |
| .B "ngram \-ppl" 
 | |
| output.
 | |
| .SH AUTHOR
 | |
| Andreas Stolcke <stolcke@icsi.berkeley.edu>
 | |
| .br
 | |
| Copyright (c) 1995\-2009 SRI International
 | |
| .br
 | |
| Copyright (c) 2011\-2016 Andreas Stolcke
 | |
| .br
 | |
| Copyright (c) 2011\-2016 Microsoft Corp.
 | 
