101 lines
3.3 KiB
Groff
101 lines
3.3 KiB
Groff
![]() |
.\" $Id: ngram-format.5,v 1.5 2007/12/19 22:08:05 stolcke Exp $
|
||
|
.TH ngram-format 5 "$Date: 2007/12/19 22:08:05 $" "SRILM File Formats"
|
||
|
.SH NAME
|
||
|
ngram-format \- File format for ARPA backoff N-gram models
|
||
|
.SH SYNOPSIS
|
||
|
.nf
|
||
|
\fB\\data\\\fP
|
||
|
\fBngram 1=\fP\fIn1\fP
|
||
|
\fBngram 2=\fP\fIn2\fP
|
||
|
\&...
|
||
|
\fBngram\fP \fIN\fP\fB=\fP\fInN\fP
|
||
|
|
||
|
\fB\\1-grams:\fP
|
||
|
\fIp\fP \fIw\fP [\fIbow\fP]
|
||
|
\&...
|
||
|
|
||
|
\fB\\2-grams:\fP
|
||
|
\fIp\fP \fIw1 w2\fP [\fIbow\fP]
|
||
|
\&...
|
||
|
|
||
|
\fB\\\fP\fIN\fP\fB-grams:\fP
|
||
|
\fIp\fP \fIw1\fP ... \fIwN\fP
|
||
|
\&...
|
||
|
|
||
|
\fB\\end\\\fP
|
||
|
.fi
|
||
|
.SH DESCRIPTION
|
||
|
The so-called ARPA (or Doug Paul) format for N-gram backoff models
|
||
|
starts with a header, introduced by the keyword \fB\\data\\\fP,
|
||
|
listing the number of N-grams of each length.
|
||
|
Following that, N-grams are listed one per line, grouped into sections
|
||
|
by length, each section starting with the keyword \fB\\\fP\fIN\fP\fB-gram:\fP,
|
||
|
where
|
||
|
.I N
|
||
|
is the length of the N-grams to follow.
|
||
|
Each N-gram line starts with the logarithm (base 10) of conditional probability
|
||
|
.I p
|
||
|
of that N-gram, followed by the words
|
||
|
.IR w1 ... wN
|
||
|
making up the N-gram.
|
||
|
These are optionally followed by the logarithm (base 10) of the
|
||
|
backoff weight for the N-gram.
|
||
|
The keyword \fB\\end\\\fP
|
||
|
concludes the model representation.
|
||
|
.PP
|
||
|
Backoff weights are required only for those N-grams
|
||
|
that form a prefix of longer N-grams in the model.
|
||
|
The highest-order N-grams in particular will not need backoff weights
|
||
|
(they would be useless).
|
||
|
.PP
|
||
|
Since log(0) (minus infinity) has no portable representation, such values
|
||
|
are mapped to a large negative number.
|
||
|
However, the designated dummy value (-99 in SRILM) is interpreted as log(0)
|
||
|
when read back from file into memory.
|
||
|
.PP
|
||
|
The correctness of the N-gram counts
|
||
|
.IR n1 ,
|
||
|
.IR n2 ,
|
||
|
\&... in the header is not enforced by SRILM software when reading
|
||
|
models (although a warning is printed when an inconsistency is encountered).
|
||
|
This allows easy textual insertion or deletion of parameters in a model file.
|
||
|
The proper format can be recovered by passsing the model through
|
||
|
the command
|
||
|
.nf
|
||
|
ngram -order \fIN\fP -lm \fIinput\fP -write-lm \fIoutput\fP
|
||
|
.fi
|
||
|
.PP
|
||
|
Note that the format is self-delimiting, allowing multiple models to
|
||
|
be stored in one file, or to be surrounded by ancillary information.
|
||
|
Some extensions of N-gram models in SRILM store additional parameters
|
||
|
after a basic N-gram section in the standard format.
|
||
|
.SH "SEE ALSO"
|
||
|
ngram(1), ngram-count(1), lm-scripts(1), pfsg-scripts(1).
|
||
|
.SH BUGS
|
||
|
The ARPA format does not allow N-grams that have only a backoff weight
|
||
|
associated with them, but no conditional probability.
|
||
|
This makes the format less general than would otherwise be useful
|
||
|
(e.g., to support pruned models, or ones containing a mix of words and
|
||
|
classes). The
|
||
|
.BR ngram-count (1)
|
||
|
tool satisfies this constraint by inserting dummy probabilities where
|
||
|
necessary.
|
||
|
.PP
|
||
|
For simplicity, an N-gram model containing N-grams up to length
|
||
|
.I N
|
||
|
is referred to in the SRILM programs as an
|
||
|
.IR N -th
|
||
|
order model, although techncally it represents a Markov model of
|
||
|
order
|
||
|
.IR N -1.
|
||
|
.SH BUGS
|
||
|
There is no way to specify words with embedded whitespace.
|
||
|
.SH AUTHOR
|
||
|
The ARPA backoff format was developed by Doug Paul at MIT Lincoln Labs
|
||
|
for research sponsored by the U.S. Department of Defense
|
||
|
Advanced Research Project Agency (ARPA).
|
||
|
.br
|
||
|
Man page by Andreas Stolcke <stolcke@speech.sri.com>.
|
||
|
.br
|
||
|
Copyright 1999, 2004 SRI International
|