competition update

This commit is contained in:
nckcard
2025-07-02 12:18:09 -07:00
parent 9e17716a4a
commit 77dbcf868f
2615 changed files with 1648116 additions and 125 deletions

View File

@@ -0,0 +1,37 @@
classes-format(5) classes-format(5)
NNAAMMEE
classes-format - File format for word class definitions
SSYYNNOOPPSSIISS
_c_l_a_s_s [_p] _w_o_r_d_1 _w_o_r_d_2 ...
DDEESSCCRRIIPPTTIIOONN
Various programs dealing with word classes use this format to define
the posssible expansions of classes and their respective probabilities.
Each expansion appears on a separate line as in the synopsis, where
_c_l_a_s_s names a word class, _p gives the probability for the class expan-
sion, and _w_o_r_d_1 _w_o_r_d_2 _._._. defines the word string that the class
expands to. If _p is omitted it is assumed to be 1. (All expansion
probabilities for a given class should sum to one, although this is not
necessarily enforced by the software and would lead to improper mod-
els.)
Note that the concept of word class here is generalized to include
``multi-words'', or phrases consisting of more than one word. All
expansions must have at least one word. Certain models might impose
more restrictive formats.
SSEEEE AALLSSOO
ngram(1), ngram-class(1), disambig(1), training-scripts(1), pfsg-
scripts(5).
AAUUTTHHOORR
Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1999 SRI International
SRILM File Formats $Date: 2007/12/19 22:08:05 $ classes-format(5)

View File

@@ -0,0 +1,64 @@
nbest-format(5) nbest-format(5)
NNAAMMEE
nbest-format - File formats for N-best hypotheses lists
DDEESSCCRRIIPPTTIIOONN
SRILM currently understands three different formats for lists of N-best
hypotheses for rescoring or 1-best hypothesis extraction. The first
two formats originated in the SRI Decipher(TM) recognition system, the
third format is particular to SRILM.
The first format consists of the header
NBestList1.0
followed by one or more lines of the form
(_s_c_o_r_e) _w_1 _w_2 _w_3 ...
where _s_c_o_r_e is a composite acoustic/language model score from the rec-
ognizer, on the bytelog scale. (A bytelog is a logarithm to base
1.0001, divided by 1024 and rounded to an integer.) This format is
output by the SRI Decipher(TM) recognizer, by the nnggrraamm --nnbbeesstt, and by
nnbbeesstt--llaattttiiccee --wwrriittee--nnbbeesstt --ddeecciipphheerr--nnbbeesstt.
The second Decipher(TM) format is an extension of the first format that
encodes word-level scores and time alignments. It is marked by a
header of the form
NBestList2.0
The hypotheses are in the format
(_s_c_o_r_e) _w_1 ( st: _s_t_1 et: _e_t_1 g: _g_1 a: _a_1 ) _w_2 ...
where words are followed by start and end times, language model and
acoustic scores (bytelog-scaled), respectively. This format may also
contain scores and time marks for sub-word units (phones and HMM
states), in the same format as above, but with the _w's denoting phone
and state names. Sub-word units will have time marks that are con-
tained in the duration of the preceding word units, and may thus be
easily identified.
The third format understood by SRILM lists hypotheses in the format
_a_s_c_o_r_e _l_s_c_o_r_e _n_w_o_r_d_s _w_1 _w_2 _w_3 ...
where the first three columns contain the acoustic model log probabil-
ity, the language model log probability, and the number of words in the
hypothesis string, respectively. All scores are logarithms base 10.
(This format must not be preceded by an ``NBestList'' header.) This
format is output by the nnggrraamm --rreessccoorree and by nnbbeesstt--llaattttiiccee --wwrriittee--
nnbbeesstt without the --ddeecciipphheerr--nnbbeesstt option.
SSEEEE AALLSSOO
ngram(1), nbest-lattice(1), segment-nbest(1), nbest-scripts(1), pfsg-
scripts(1).
BBUUGGSS
All these formats are somewhat ad hoc and could use a more rational
design. The ``NBestList1.0'' format is particularly cumbersome because
it conflates acoustic and language model scores.
A generalization to an arbitrary number of separate scores would be
nice.
AAUUTTHHOORR
Manual page written by Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1999-2001 SRI International
SRILM File Formats $Date: 2007/12/19 22:08:05 $ nbest-format(5)

View File

@@ -0,0 +1,89 @@
ngram-format(5) ngram-format(5)
NNAAMMEE
ngram-format - File format for ARPA backoff N-gram models
SSYYNNOOPPSSIISS
\\ddaattaa\\
nnggrraamm 11==_n_1
nnggrraamm 22==_n_2
...
nnggrraamm _N==_n_N
\\11--ggrraammss::
_p _w [_b_o_w]
...
\\22--ggrraammss::
_p _w_1 _w_2 [_b_o_w]
...
\\_N--ggrraammss::
_p _w_1 ... _w_N
...
\\eenndd\\
DDEESSCCRRIIPPTTIIOONN
The so-called ARPA (or Doug Paul) format for N-gram backoff models
starts with a header, introduced by the keyword \\ddaattaa\\, listing the
number of N-grams of each length. Following that, N-grams are listed
one per line, grouped into sections by length, each section starting
with the keyword \\_N--ggrraamm::, where _N is the length of the N-grams to fol-
low. Each N-gram line starts with the logarithm (base 10) of condi-
tional probability _p of that N-gram, followed by the words _w_1..._w_N mak-
ing up the N-gram. These are optionally followed by the logarithm
(base 10) of the backoff weight for the N-gram. The keyword \\eenndd\\ con-
cludes the model representation.
Backoff weights are required only for those N-grams that form a prefix
of longer N-grams in the model. The highest-order N-grams in particu-
lar will not need backoff weights (they would be useless).
Since log(0) (minus infinity) has no portable representation, such val-
ues are mapped to a large negative number. However, the designated
dummy value (-99 in SRILM) is interpreted as log(0) when read back from
file into memory.
The correctness of the N-gram counts _n_1, _n_2, ... in the header is not
enforced by SRILM software when reading models (although a warning is
printed when an inconsistency is encountered). This allows easy tex-
tual insertion or deletion of parameters in a model file. The proper
format can be recovered by passsing the model through the command
ngram -order _N -lm _i_n_p_u_t -write-lm _o_u_t_p_u_t
Note that the format is self-delimiting, allowing multiple models to be
stored in one file, or to be surrounded by ancillary information. Some
extensions of N-gram models in SRILM store additional parameters after
a basic N-gram section in the standard format.
SSEEEE AALLSSOO
ngram(1), ngram-count(1), lm-scripts(1), pfsg-scripts(1).
BBUUGGSS
The ARPA format does not allow N-grams that have only a backoff weight
associated with them, but no conditional probability. This makes the
format less general than would otherwise be useful (e.g., to support
pruned models, or ones containing a mix of words and classes). The
nnggrraamm--ccoouunntt(1) tool satisfies this constraint by inserting dummy proba-
bilities where necessary.
For simplicity, an N-gram model containing N-grams up to length _N is
referred to in the SRILM programs as an _N-th order model, although
techncally it represents a Markov model of order _N-1.
BBUUGGSS
There is no way to specify words with embedded whitespace.
AAUUTTHHOORR
The ARPA backoff format was developed by Doug Paul at MIT Lincoln Labs
for research sponsored by the U.S. Department of Defense Advanced
Research Project Agency (ARPA).
Man page by Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1999, 2004 SRI International
SRILM File Formats $Date: 2007/12/19 22:08:05 $ ngram-format(5)

View File

@@ -0,0 +1,65 @@
pfsg-format(5) pfsg-format(5)
NNAAMMEE
pfsg-format - File format for Decipher(TM) probabilistic finite-state
grammars
SSYYNNOOPPSSIISS
nnaammee _n_a_m_e
nnooddeess _N _w_1 ... _w_N
iinniittiiaall _i
ffiinnaall _f
ttrraannssiittiioonnss _T
_n_1 _n_2 _p
...
DDEESSCCRRIIPPTTIIOONN
Probabilistic finite-state grammars (PFSGs) are a form of finite-state
automaton or transducer used by the SRI Decipher(TM) recognizer. PFSGs
emit words (outputs) at the nodes, not on the arcs. Certain types of
language models manipulated by SRILM can be translated into PFSGs for
direct use in the recognizer.
Since it is usually fairly easy to convert between different finite-
state network representations, PFSGs can serve as an intermediate for-
mat for the generation of other finite-state formats. For example,
PFSGs can be converted to the AT&T ffssmm(5) format.
Each PFSGs is given a _n_a_m_e. The name is significant if PFSGs are to be
composed, in which case the _n_a_m_e specifies the category it expands.
The nnooddeess line gives the number of nodes in the state graph, followed
by the word strings associated with each node. If the node represents
a category expanded by another PFSG, then the name string of that PFSG
is given here. The token NNUULLLL is special and designates the corre-
sponding node as non-emitting. It is conventional to use lowercase
strings for words, and uppercase for categories and PFSG names
(``NULL'' must be avoided, of course).
The iinniittiiaall and ffiinnaall lines specify the start and end states of the
grammar, respectively. Nodes are numbered starting at zero.
The ttrraannssiittiioonnss line gives the number of arcs (transitions) between
states. It is followed by as many lines, each specifying one transi-
tion by its originating state _n_1, its target state _n_2, and the transi-
tion cost _p. The transition cost is usually interpreted as 10000.5
times the natural logarithm of a probability, and should be normalized
and scaled accordingly.
SSEEEE AALLSSOO
pfsg-scripts(1), fsm(1).
BBUUGGSS
File formats are a matter of taste ...
There is no way to specify words with embedded whitespace.
AAUUTTHHOORR
PFSGs were developed as part of SRI's Decipher(TM) recognition system.
Manual page written by Andreas Stolcke <stolcke@speech.sri.com>.
Copyright 1999, 2004 SRI International
SRILM File Formats $Date: 2007/12/19 22:08:05 $ pfsg-format(5)

View File

@@ -0,0 +1,116 @@
wlat-format(5) File Formats Manual wlat-format(5)
NNAAMMEE
wlat-format - File format for SRILM word posterior lattices
SSYYNNOOPPSSIISS
Word lattices:
vveerrssiioonn 22
nnaammee _s
iinniittiiaall _i
ffiinnaall _f
nnooddee _n _w _a _p _n_1 _p_1 _n_2 _p_2 ...
...
Word meshes (confusion networks):
nnaammee _s
nnuummaalliiggnnss _N
ppoosstteerriioorr _P
aalliiggnn _a _w_1 _p_1 _w_2 _p_2 ...
rreeffeerreennccee _a _w
hhyyppss _a _w _h_1 _h_2 ...
iinnffoo _a _w _s_t_a_r_t _d_u_r _a_s_c_o_r_e _g_s_c_o_r_e _p_h_o_n_e_s _p_h_o_n_e_d_u_r_s
ttiimmee _a _t
...
DDEESSCCRRIIPPTTIIOONN
Word posterior lattices and meshes are lattices generated by aligning
N-best hypotheses with nnbbeesstt--llaattttiiccee(1), or by aligning PFSG or HTK
lattices with llaattttiiccee--ttooooll(1). They compactly encode possible word
hypotheses sequences and their posterior probabilities. (Word meshes
have become generally known as ``confusion networks'' or ``sausages.'')
A word lattice is a partially ordered directed graph with nodes repre-
senting word hypotheses. Nodes are identified by non-negative inte-
gers. The file format specifies the initial node _i, the final node _f,
and any number of additional nodes _n. For each node _n the following
associated information is given on the same line: the word identity _w
(the string ``NULL'' is used with initial and final nodes), the align-
ment position _a (identical values in this field identify hypotheses
that occur at the same position), and the word posterior probability _p.
Following these values, zero or more transitions to successor nodes are
specified, each given by the node index _n_i and the transition posterior
probability _p_i. In a properly normalized word lattice the transition
posteriors _p_i sum up to the node posterior _p.
Word meshes represent a more constrained lattice format in which word
hypotheses are in a total order. A mesh contains a number of alignment
positions, and a set of mutually exclusive word hypotheses in each
position (the ``confusion sets''). The word mesh represents all sen-
tence hypotheses that can be generated by freely combining word
hypotheses at each position. The file format specifies the number of
alignment positions _A and the total posterior probability mass _P con-
tained in the lattice, followed by one or more confusion set specifica-
tions. For each alignment position _a, the hypothesized words _w_i and
their posterior probabilities _p_i are listed in alternation. The
pseudo-word string **DDEELLEETTEE** represents an empty hypothesis.
Optionally, the word mesh format encodes additional information about
the hypothesis alignment from which it resulted. The keyword rreeffeerreennccee
specifies the correct word _w that was aligned at position _a. The key-
word hhyyppss is used to list the sentence hypotheses of which a certain
word hypothesis was a part. The word hypothesis is identified by an
alignment postion _a and the word string _w, and is followed by the inte-
ger IDs _h_i (typically, the N-best ranks) of the associated sentence
hypotheses.
As another optional element, the word mesh can contain word-level
acoustic and temporal information, following the keyword iinnffoo, the
alignment position _a, and the word identity _w. This information is
derived by nnbbeesstt--llaattttiiccee(1) from word- and phone-level backtraces of N-
best hypotheses (as represented in Decipher NBestList2.0 format). The
details of this information are defined in the SRILM class NNBBeesstt--
WWoorrddIInnffoo and subject to change, but currently include the following.
_s_t_a_r_t: word start time (in seconds from the beginning of the waveform);
_d_u_r: word duration (in seconds); _a_s_c_o_r_e: acoustic model likelihood (log
base 10); _g_s_c_o_r_e: grammar (LM and pronunciation) score (log base 10);
_p_h_o_n_e_s: sequence of phones in word (separated by colons); _p_h_o_n_e_d_u_r_s:
sequence of phone durations (in numbers of frames, separated by
colons). When word meshes are derived from HTK format lattices, pro-
nunciation field will consist of the HTK phone alignment information,
which encodes both phone sequence and durations; the phone duration
field in turn is used to encode the duration model scores, if present.
NNoottee:: The encoded information pertains to the word hypothesis with the
highest posterior probability among all hypotheses of the same word
aligned to a given word mesh position.
The ttiimmee keyword is used for debugging purposes and encodes the esti-
mated timestamp _t of an alignment position _a when the input contains
backtrace information. It is ignored when reading in word meshes.
Both formats optionally encode the associated utterance IDs in the nnaammee
field. Word lattices and meshes can be converted to PFSG format using
the script wwllaatt--ttoo--ppffssgg.
SSEEEE AALLSSOO
nbest-lattice(1), lattice-tool(1), pfsg-scripts(1), pfsg-format(5),
nbest-format(5).
L. Mangu, E. Brill, & A. Stolcke, ``Finding consensus in speech recog-
nition: word error minimization and other applications of confusion
networks,'' _C_o_m_p_u_t_e_r _S_p_e_e_c_h _a_n_d _L_a_n_g_u_a_g_e 14(4), 373-400, 2000.
BBUUGGSS
Detailed alignment and acoustic information is so far only implemented
for word meshes, although conceptually it would apply equally to word
lattices.
AAUUTTHHOORR
Andreas Stolcke <andreas.stolcke@microsoft.com>
Copyright 2001-2011 SRI International
Copyright 2011-2019 Microsoft Corp.
SRILM File Formats $Date: 2019/02/06 09:53:12 $ wlat-format(5)