competition update

2025-07-02 12:18:09 -07:00
parent 9e17716a4a
commit 77dbcf868f
2615 changed files with 1648116 additions and 125 deletions
--- a/language_model/srilm-1.7.3/man/cat5/nbest-format.5
+++ b/language_model/srilm-1.7.3/man/cat5/nbest-format.5
@@ -0,0 +1,64 @@
+nbest-format(5)                                                nbest-format(5)
+
+
+
+NNAAMMEE
+       nbest-format - File formats for N-best hypotheses lists
+
+DDEESSCCRRIIPPTTIIOONN
+       SRILM currently understands three different formats for lists of N-best
+       hypotheses for rescoring or 1-best hypothesis  extraction.   The  first
+       two  formats originated in the SRI Decipher(TM) recognition system, the
+       third format is particular to SRILM.
+
+       The first format consists of the header
+            NBestList1.0
+       followed by one or more lines of the form
+            (_s_c_o_r_e) _w_1 _w_2 _w_3 ...
+       where _s_c_o_r_e is a composite acoustic/language model score from the  rec-
+       ognizer,  on  the  bytelog  scale.   (A  bytelog is a logarithm to base
+       1.0001, divided by 1024 and rounded to an  integer.)   This  format  is
+       output  by the SRI Decipher(TM) recognizer, by the nnggrraamm --nnbbeesstt, and by
+       nnbbeesstt--llaattttiiccee --wwrriittee--nnbbeesstt --ddeecciipphheerr--nnbbeesstt.
+
+       The second Decipher(TM) format is an extension of the first format that
+       encodes  word-level  scores  and  time  alignments.   It is marked by a
+       header of the form
+            NBestList2.0
+       The hypotheses are in the format
+            (_s_c_o_r_e) _w_1 ( st: _s_t_1 et: _e_t_1 g: _g_1 a: _a_1 ) _w_2 ...
+       where words are followed by start and end  times,  language  model  and
+       acoustic  scores  (bytelog-scaled), respectively.  This format may also
+       contain scores and time  marks  for  sub-word  units  (phones  and  HMM
+       states),  in  the same format as above, but with the _w's denoting phone
+       and state names.  Sub-word units will have time  marks  that  are  con-
+       tained  in  the  duration  of the preceding word units, and may thus be
+       easily identified.
+
+       The third format understood by SRILM lists hypotheses in the format
+            _a_s_c_o_r_e _l_s_c_o_r_e _n_w_o_r_d_s _w_1 _w_2 _w_3 ...
+       where the first three columns contain the acoustic model log  probabil-
+       ity, the language model log probability, and the number of words in the
+       hypothesis string, respectively.  All scores are  logarithms  base  10.
+       (This  format  must  not be preceded by an ``NBestList'' header.)  This
+       format is output by the nnggrraamm --rreessccoorree  and  by  nnbbeesstt--llaattttiiccee  --wwrriittee--
+       nnbbeesstt without the --ddeecciipphheerr--nnbbeesstt option.
+
+SSEEEE AALLSSOO
+       ngram(1),  nbest-lattice(1),  segment-nbest(1), nbest-scripts(1), pfsg-
+       scripts(1).
+
+BBUUGGSS
+       All these formats are somewhat ad hoc and could  use  a  more  rational
+       design.  The ``NBestList1.0'' format is particularly cumbersome because
+       it conflates acoustic and language model scores.
+       A generalization to an arbitrary number of  separate  scores  would  be
+       nice.
+
+AAUUTTHHOORR
+       Manual page written by Andreas Stolcke <stolcke@speech.sri.com>.
+       Copyright 1999-2001 SRI International
+
+
+
+SRILM File Formats       $Date: 2007/12/19 22:08:05 $          nbest-format(5)