2169 lines
		
	
	
		
			95 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			2169 lines
		
	
	
		
			95 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| 
 | |
| Version History
 | |
| 
 | |
| 0.90	29 Jun 95	first working code, n-gram models only
 | |
| 
 | |
| 0.91	02 Aug 95	snapshot for fosler@icsi, minor bug fixes
 | |
| 
 | |
| 0.92	13 Aug 95	added BayesMix, VarNgram LMs
 | |
| 
 | |
| 0.93	27 Aug 95	included all LM95 code
 | |
| 
 | |
| 0.94	13 Oct 95
 | |
| 	* new directory structure mirroring DECIPHER layout.
 | |
| 	* man pages added
 | |
| 	* added support for Decipher N-best list rescoring
 | |
| 	* added Null LM
 | |
| 	* added new utility scripts
 | |
| 	* bug fixes
 | |
| 
 | |
| 0.95	08 Sep 96	as of WS96
 | |
| 	* added Trellis class, disambig program
 | |
| 	* added support for pause tokens (-pau-) in sentences
 | |
| 	  (these are ignored for sentence prob computation)
 | |
| 	* added -tolower mapping
 | |
| 	* added word reversal
 | |
| 	* made Ngram model reading much faster (optimized floating point parsing)
 | |
| 	* added template class for ngram count tries (to use either integer or
 | |
| 	  float count value)
 | |
| 	* added optional noise tag skipping
 | |
| 	* added SkipNgram model
 | |
| 	* added Witten-Bell backoff
 | |
| 	* ported to native Sun and SGI C++ compilers (see doc/c++porting-notes),
 | |
| 	* suppress log10(0.0) warnings
 | |
| 
 | |
| 0.96	05 Jun 97
 | |
| 	* Honor -gtNmin parameter even when discounting of higher counts
 | |
| 	  is effectively disabled.  (Allows building maximum likelihood LMs
 | |
| 	  smoothed only by low-count ngram elimination.)
 | |
| 	* Ignore pauses and noise in nbest-lattice alignments (also added 
 | |
| 	  -noise option).
 | |
| 	* ngram now supports mixtures of up to 6 ngram models.
 | |
| 	* added HiddenSNgram LM.
 | |
| 	* warn about multiple uses of '-' file for input or output
 | |
| 	* zio now handles incomplete reading of compressed file without error
 | |
| 	* Fixed interaction between deletion and iterations
 | |
| 	* Fixed handling of OOVs in cache model
 | |
| 	* Fixed decipher N-best rescoring: we now duplicate even the 
 | |
| 	  roundoff errors incurred by bytelogs.  Also added -decipher flag
 | |
| 	  to ngram to allow replication of recognizer LM scores.
 | |
| 	  Also, takes into account that Decipher (incorrectly) applies WTW
 | |
| 	  even to pauses.
 | |
| 	* Enhanced decipher-rescore script to deal with NBestList2.0 format,
 | |
| 	  with -bytelog and -nodecipherlm options .
 | |
| 	* Added tools to convert bigram and trigram backoff LMs into 
 | |
| 	  Decipher PFSG format (pfsg-from-ngram).
 | |
| 	* Enable DecipherNgram models order higher than bigram
 | |
| 	  (ngram -decipher-order flag).  Default is still bigram.
 | |
| 	* Fixed bug that caused float command line arguments to be parsed
 | |
| 	  incorrectly on SunOS4 systems (missing declaration in system header).
 | |
| 
 | |
| 0.97 	30 Aug 97	as of WS97
 | |
| 	* New programs: segment and segment-nbest (moved here from
 | |
| 	  development code).
 | |
| 	* Made low-level NgramLM access functions public
 | |
| 	  (findProb, findBOW, insertProb, insertBOW).
 | |
| 	* Fixed nbest-lattice to use normalized posterior word
 | |
| 	  probabilities in lattice.
 | |
| 	* NBest, nbest-lattice: added N-best error computation.
 | |
| 	* WordLattice, nbest-lattice: added lattice error computation.
 | |
| 	* WordLattice: base all alignments on edit distance costs defined
 | |
| 	  in WordAlign.h.
 | |
| 	* contextID() now also returns length of context used.
 | |
| 	  Added contextID() implementations for NullLM and BayesMix.
 | |
| 	* Fixed contextID() for Ngram: don't truncate context if BOW = 1.
 | |
| 	* Fixed SArray, LHash to avoid assignment operator on remove().
 | |
| 	* Fixed add-ppls, subtract-ppls to handle -ppl -debug 2 output.
 | |
| 	* Lots of memory management fixes.
 | |
| 	* SArrayIter and LHashIter now work even while underlying object is
 | |
| 	  being moved (as when containing data structure is enlarged).
 | |
| 	* Added HTK Lattice tool interface (htk/ directory).
 | |
| 	* Made Trellis into a template class.
 | |
| 	* Allow arbitrary n-gram orders with disambig(1).
 | |
| 	* Added forward-backward decoding and posterior probability computation
 | |
| 	  to disambig(1).
 | |
| 	* Added disambig -lmw and -mapw options.
 | |
| 	* Added HMMofNGrams model (ngram -hmm option).
 | |
| 	* VocabMap reader now warns about duplicate entries
 | |
| 
 | |
| 0.98	18 April 98
 | |
| 	* Allow ngram to disable Decipher LM backoff hack, for rescoring
 | |
| 	  new exact lattices (ngram -decipher-nobackoff).
 | |
| 	* N-best list vocabulary is now always expanded dynamically
 | |
| 	  (no more OOVs in N-best lists).
 | |
| 	* Added wrapper script for nbest-lattice to compute N-best error rate
 | |
| 	  (nbest-error).
 | |
| 	* Skip ngrams exceeding model order when reading.
 | |
| 	* Fixed memory bug in generateSentence().
 | |
| 	* Changed libmisc to work with Tcl version > 7.
 | |
| 	* Compute word error correctly for empty N-best list.
 | |
| 	* Added ngram pruning based on model perplexity change
 | |
| 	  (ngram-count -prune and ngram -prune).
 | |
| 	* Old ngram -prune option renamed -varprune.
 | |
| 	* New lattice word error minimization (nbest-lattice -lattice-wer).
 | |
| 	* Fixed ngram -gen bug due to omissions in SunOS4 header files.
 | |
| 	* merge-batch-counts removes merged source files
 | |
| 	* Added ngram -prune-lowprobs function to do the work of
 | |
| 	  remove-lowprob-ngrams, but much faster and using less memory.
 | |
| 	* Added support for new Decipher NBestList2.0 format.
 | |
| 	* Added word error count and posterior probability fields to NBestHyp
 | |
| 	  structure.
 | |
| 	* Added optional factor argument to countSentence() (convenient
 | |
| 	  to compute fractional sufficient statistics for alternative
 | |
| 	  training methods).
 | |
| 	* Don't make special symbols (<s>, </s>, <unk>) member of SubVocab
 | |
| 	  by default.
 | |
| 	* Ported to gcc 2.8.1 .
 | |
| 
 | |
| 0.99	31 July 1999
 | |
| 	* Added hidden-ngram (word-boundary tagger).
 | |
| 	* Removed line length limit for File object.
 | |
| 	* Added disambig -continuous flag.
 | |
| 	* Fixed backward computation in disambig (again).
 | |
| 	* Generalized compute-best-mix to N > 2 models
 | |
| 	* Added AdaptiveMix LM class
 | |
| 	* Added nbest-mix utility (interpolation of N-best posteriors)
 | |
| 	* Added ngram -unk flag to handle open-class LMs
 | |
| 	* Added disambig and hidden-ngram -text-map option
 | |
| 	* Script enhancements:
 | |
| 	  - New script to convert nbest-lattice word graphs to PFSG
 | |
| 	    (wlat-to-pfsg)
 | |
| 	  - Added switches include probabilities in wlat-to-dot and pfsg-to-dot
 | |
| 	    output.
 | |
| 	  - Conversion to/from AT&T FSM format: fsm-to-pfsg and pfsg-to-fsm
 | |
| 	* ngram -rescore and associated scripts no longer set a hyp 
 | |
| 	   probability to zero if it contains OOVs. Instead, the probability
 | |
| 	   is computed ignoring those words (more useful in practice).
 | |
| 	   A warning is output as always.
 | |
| 	* Added ngram-count -float-counts option.
 | |
| 	* Added build support for Linux/i686 platform.
 | |
| 
 | |
| 1.00 	8 June 2000
 | |
| 	* Added ClassNgram class and ngram -classes option.
 | |
| 	* Capability to convert class ngrams into word ngrams.
 | |
| 	* New program ngram-class for automatic word class induction.
 | |
| 	* Fixed interaction of ngram -mix-lm -bayes with non-standard n-grams:
 | |
| 	  can now build an interpolation of the non-standard (hidden-event,
 | |
| 	  class-based, etc.) n-gram with the additional, standard n-grams.
 | |
| 	* Replaced LM.noiseTag with LM.noiseVocab (list of noise tags to
 | |
| 	  be ignored).  Tools now take -noise-vocab option (as well as -noise
 | |
| 	  for backward compatibility).
 | |
| 	* Made ngram -counts work for non-n-gram models.
 | |
| 	* Added nbest-lattice -posterior-{amw,lmw,wtw} options to compute
 | |
| 	  word posteriors with different weightings from the one used in
 | |
| 	  hypothesis ranking.  Also added -deletion-bias flag for explicit
 | |
| 	  control of del/ins errors (-use-mesh mode only).
 | |
| 	* NBest rescoring methods now have optional acoustic model weight
 | |
| 	  (defaulting to 1 as before).
 | |
| 	* New class RefList (list of reference transcripts).
 | |
| 	* New class NBestSet (set of N-Best lists).
 | |
| 	* NBest, NBestSet, and nbest-lattice optionally split multiwords into
 | |
| 	  their components on reading (-multiwords option).
 | |
| 	* New nbest-optimize tool for finding near-optimal score combination
 | |
| 	  weights for word error minimizing N-best rescoring.
 | |
| 	* New anti-ngram program, for computing posterior-weighted N-gram
 | |
| 	  counts from N-best lists.
 | |
| 	* New nbest-rover script allows ROVER-style combination of hypotheses
 | |
| 	  from multiple N-best lists.
 | |
| 	* New rescore-decipher -norescore option, to reformat N-best lists
 | |
| 	  without LM rescoring.
 | |
| 	* Fixed bugs related to missing <s> and </s> in change-lm-vocab and
 | |
| 	  make-ngram-pfsg.
 | |
| 	* Significant speedups in LMs involving dynamic programming
 | |
| 	  (HiddenNgram, DFNgram, HMMofNgrams) when interpolating with other 
 | |
| 	  models or running in "ngram -debug 2" mode.
 | |
| 	* Allow absolute discounting on fractional counts, for more
 | |
| 	  effective construction of models from fractional counts.
 | |
| 	* Added ngram-merge -float-counts option, and allow "-" (stdin) as
 | |
| 	  input file.
 | |
| 	* ngram-count ensures <s> unigram (with prob 0) is defined to avoid
 | |
| 	  breaking other programs.
 | |
| 	* Added make-abs-discount script to compute absolute discounting 
 | |
| 	  constants from Good-Turing statistics.
 | |
| 	* compute-sclite and compare-sclite now take -multiwords option to
 | |
| 	  split compound words prior to scoring.
 | |
| 	* Changed option handling so that unsigned option arguments are forced
 | |
| 	  to be non-negative.
 | |
| 	* Added Map2 (2D Map) class to libdstruct.
 | |
| 	* Much better string hash function (borrowed from Tcl).
 | |
| 	* New man pages: training-scripts(1), lm-scripts(1), ppl-scripts(1),
 | |
| 	  pfsg-scripts(1), nbest-scripts(1), lm-format(5), classes-format(5),
 | |
| 	  pfsg-format(5), nbest-format(5).
 | |
| 
 | |
| 1.0.1	12 July 2000
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* wordError() and nbest-lattice -dump-errors now also output the
 | |
| 	  location of deletions in the alignment (NOTE: possible code
 | |
| 	  incompatibility).
 | |
| 	* New reverse-ngram-counts script.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Workarounds for shortcomings in Linux gcc, math library, and linker.
 | |
| 	* make-ngram-pfsg: don't ignore bigram states with zero BOW (bugfix).
 | |
| 	* nbest-rover: fixed problem with handling of + lines.
 | |
| 
 | |
| 1.1	21 May 2001
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* HiddenNgram class generalized to deal with disfluency-type events
 | |
| 	  that manipulate the N-gram context.
 | |
| 	* rescore-reweight script now accepts additional score directories
 | |
| 	  (and associated score weights) for combination of an arbitrary number
 | |
| 	  of knowledge sources.
 | |
| 	* Enhanced rescore-decipher functionality:
 | |
| 	  - Option -lm-only to produce output containing LM scores only
 | |
| 	  - Option -pretty to perform word mapping on the fly.
 | |
| 	  - Warn about and handle LM scores that are NaN.
 | |
| 	* New class VocabMultiMap, implementing dictionary-style mappings of
 | |
| 	  words to strings from another vocabulary.
 | |
| 	* Added support for pronunciation-based word alignments in
 | |
| 	  WordMesh and nbest-lattice -use-mesh .
 | |
| 	* Added nbest-lattice -keep-noise option to preserve pauses and noises
 | |
| 	  in alignments.
 | |
| 	* Support for multiwords: - make-multiword-pfsg expands PFSGs to use
 | |
| 	  multiwords (using AT&T FSM tools).
 | |
| 	  - multi-ngram expands N-gram LM to include multiwords.
 | |
| 	* Added support for Decipher Intlog scaled log probabilities.
 | |
| 	* Added ngram -seed option to initialize random sentence generation
 | |
| 	  (contributed by Eric Fosler).
 | |
| 	* New add-pauses-to-pfsg pause= and version= options to allow
 | |
| 	  generation of Nuance-compatible PFSGs (see man page for details).
 | |
| 	* The NBest class and scripts handle NBestList2.0 format containing
 | |
| 	  phone and/or state backtraces (by ignoring them).
 | |
| 	* Added Amoeba search option to nbest-optimize (contributed by
 | |
| 	  Dimitra Vergyri).
 | |
| 	* Added standard 1-best optimization mode to nbest-optimize.
 | |
| 	* wlat-to-pfsg script now also processes confusion networks output by
 | |
| 	  nbest-lattice -use-mesh .
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* ngram -decipher-nobackoff now applies to the -lm ngram as well if
 | |
| 	  option -decipher is also specified.
 | |
| 	* ngram -expand-classes no longer dumps core when handling
 | |
| 	  "context-free" class expansions (though those aren't supported).
 | |
| 	* gawk path in scripts is now adjusted prior to installation
 | |
| 	  (/usr/bin/gawk for Linux, /usr/local/bin/gawk elsewhere).
 | |
| 	* Fixed numerical problems in nbest-rover/nbest-posteriors.
 | |
| 	* ngram-counts -float-counts behaved differently from equivalent
 | |
| 	  integer-count estimation;  both integer and float counts now use
 | |
| 	  the same estimation code.
 | |
| 	* Reduced memory requirements of nbest-optimize by about 25%.
 | |
| 	* Minor changes for gcc-2.95.3.
 | |
| 
 | |
| 1.1.1	20 July 2001
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* WordMesh: new interface to record reference word string in alignment.
 | |
| 	* nbest-lattice: confusion networks can now record reference words
 | |
| 	  if specified with -reference, and are preserved by -write/-read.
 | |
| 	* replace-words-with-classes now has option to process ngram count
 | |
| 	  files (have_counts=1).
 | |
| 	* merge-nbest: new utility to merge N-best hyps from multiple lists.
 | |
| 	* wlat-stats: new utility to compute statistics of word posterior
 | |
| 	  lattices.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* GT discounting: fixed anomaly due to different floating point
 | |
| 	  precision on x86 platforms.
 | |
| 	* anti-ngram(1): documented options previously omitted.
 | |
| 	* WordMesh: reading/writing of confusion networks now preserves 
 | |
| 	  total posterior mass.
 | |
| 	* Changed the hypothesis alignment order in nbest-optimize to be
 | |
| 	  more compatible with decoding in nbest-lattice: first align nbest
 | |
| 	  hyps in order of decreasing (initial) scores, then align reference.
 | |
| 	  nbest-optimize -no-reorder keeps the old behavior (with references
 | |
| 	  anchoring the alignment).  All scores and initial lambdas are now
 | |
| 	  used to compute initial posterior hyp probabilities to guide the
 | |
| 	  hypothesis alignment; thus, it now makes sense to restart an
 | |
| 	  optimization with partially optimized weights to revised the
 | |
| 	  alignments.
 | |
| 	* nbest-optimize now warns about missing or incomplete score files.
 | |
| 	* Fixed a memory access error in nbest-optimize -1best.
 | |
| 	* Fixed weight normalization in nbest-optimize when first element is 0.
 | |
| 	* Miscellaneous fixes for compile under RH Linux 7.0.
 | |
| 
 | |
| 1.2	20 November 2001
 | |
| 	
 | |
| 	Functionality:
 | |
| 
 | |
| 	* nbest-lattice -dictionary allows word alignments to be guided by
 | |
| 	  dictionary pronunciations.
 | |
| 	* nbest-lattice -use-mesh -record-hyps records the rank of N-best hyps
 | |
| 	  contributing to each word hypothesis in the confusion network.
 | |
| 	* nbest-lattice -no-rescore and -decipher-format options make it
 | |
| 	  more convenient as an N-best format conversion tool.
 | |
| 	* VocabDistance: new class and subclasses to represent distance metrics
 | |
| 	  (e.g., phonetic distance) over vocabularies.
 | |
| 	* WordMesh: output word hyps in order of decreasing posteriors.
 | |
| 	* WordMesh: reading/writing of confusion networks now includes hyp IDs
 | |
| 	  from alignment.
 | |
| 	* NBest/MultiAlign/WordMesh: support for keeping extra word-level 
 | |
| 	  information (NBeSTWordInfo).
 | |
| 	* nbest-lattice: unified single and multiple file processing.
 | |
| 	  New option -write-dir to write multiple output lattices.
 | |
| 	  New option -refs to supply multiple references.
 | |
| 	  Options -nbest-errors and -lattice-errors are replaced by 
 | |
| 	  switches -nbest-error/-lattice-error, in conjunction with
 | |
| 	  -references/-refs.  Outputs are now prefixed by utterance IDs
 | |
| 	  when processing multiple files.
 | |
| 	* nbest-lattice -nbest-backtrace enables processing of backtrace 
 | |
| 	  information from N-best lists; combined with -use-mesh this produces
 | |
| 	  sausages that contain word-level scores and alignment information,
 | |
| 	  as well as phone backtraces (see new wlat-format(5) man page).
 | |
| 	* wlat-stats script now also computes error statistics when processing
 | |
| 	  confusion networks with references.
 | |
| 	* nbest-rover now handles N-best lists in Decipher format.
 | |
| 	* hidden-ngram and disambig: new option -fw-only to use only forward
 | |
| 	  probabilities for posterior computation.
 | |
| 	* rescore-decipher -filter option to apply textual rewriting filters
 | |
| 	  to hypotheses before rescoring.
 | |
| 	* segment-nbest -write-nbest-dir option for dumping rescored N-best
 | |
| 	  lists to a directory instead of to stdout.
 | |
| 	* segment-nbest -start-tag and -end-tag options to insert tags at
 | |
| 	  margins of N-best hyps.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 		
 | |
| 	* WordMesh: computation of deletion costs using a dictionary distance
 | |
| 	  was completely bogus (only affected undocumented nbest-lattice
 | |
| 	  -dictionary option).
 | |
| 	* nbest-lattice: correctly process -nbest-files using -dictionary in
 | |
| 	  alignment.
 | |
| 	* nbest-rover: fixed to work on Linux 
 | |
| 	* hidden-ngram: don't abort when an event posterior is 0.
 | |
| 	* hidden-ngram: avoid abort when *noevent* occurs in -hidden-vocab list.
 | |
| 	* segment-nbest: now correctly uses ngram contexts longer than trigram.
 | |
| 	* segment-nbest: optimized -bias 0 case by disallowing sentence
 | |
| 	  boundary states altogether.
 | |
| 	* multi-ngram -prune-unseen-ngrams prevents insertion of multiword
 | |
| 	  N-grams whose component N-grams were not in the original model.
 | |
| 	* ngram: fixed computation of mixture lambda for second LM when three
 | |
| 	  or more models are interpolated.
 | |
| 	* nbest-posterior (and thus nbest-rover) no longer split multiwords by
 | |
| 	  themselves.  To split multiwords with nbest-rover, append the
 | |
| 	  -multiwords option to the argument list, which is passed on to
 | |
| 	  nbest-lattice to achieve the desired effect.
 | |
| 	* ngram -renorm now applies BEFORE class expansion or pruning of 
 | |
| 	  model (in case input model is unnormalized).
 | |
| 	* make-nbest-pfsg bug involving transition into final node fixed.
 | |
| 	* Minor script changes to avoid warnings with gawk 3.1.0.
 | |
| 
 | |
| 1.3	11 February 2002
 | |
| 	
 | |
| 	Functionality:
 | |
| 
 | |
| 	* Trellis class, disambig and hidden-ngram tools: added support for
 | |
| 	  N-best decoding (contributed by Anand Venkataraman).
 | |
| 
 | |
| 	* MultiwordLM wrapper LM class as a convenient way to split multiwords
 | |
| 	  prior to LM evaluation.
 | |
| 
 | |
| 	* New MultiwordVocab class to support MultiwordLM.
 | |
| 
 | |
| 	* Added ngram -multiwords option (based on MultiwordLM wrapper).
 | |
| 
 | |
| 	* Added support for Chen & Goodman's Modified Kneser-Ney smoothing
 | |
| 	  and interpolated backoff estimates.  See ngram-count options
 | |
| 	  -kndiscount[1-6], -kn[1-6], and interpolate[1-6].
 | |
| 
 | |
| 	* New library and tool for lattice manipulation: lattice-tool.
 | |
| 
 | |
| 	* New nbest-mix -set-am-scores and -set-lm-scores options. These allow
 | |
| 	setting either the AM or the LM scores in the N-best output to simulate
 | |
| 	the combined posteriors, while preserving the other scores.
 | |
| 
 | |
| 	* Added some regression tests (test/ subdirectory).
 | |
| 
 | |
| 	* Support for Windows via CYGWIN porting layer (MACHINE_TYPE=cygwin).
 | |
| 	See doc/README.windows for details.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Trellis: deallocate old trellis nodes on demand in init(), rather
 | |
| 	  than preemptively in clear().  Greatly speeds up forward computation
 | |
| 	  for trellis-based LMs (e.g., ClassNgram).
 | |
| 
 | |
| 	* Textstats: fix to handle zero denominator in ppl computation.
 | |
| 
 | |
| 	* disambig: fixed off-by-one error indexing into trellis.
 | |
| 
 | |
| 	* Miscellaneous small fixes for compilation and operation under Windows
 | |
| 	(using the CYGWIN environment).
 | |
| 
 | |
| 	Warning: See doc/README.x86 about a gcc compiler bug that might 
 | |
| 	affect you on Intel platforms.
 | |
| 
 | |
| 1.3.1	25 June 2002
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* nbest-optimize -write-rover-control option conveniently dumps a
 | |
| 	control file for nbest-rover that encodes the optimized parameters.
 | |
| 	* New regression tests for nbest-rover (i.e., nbest-lattice) and
 | |
| 	nbest-optimize.
 | |
| 	* nbest-posteriors, combine-acoustic-scores now all handle and
 | |
| 	preserve Decipher N-best formats.  This allows nbest-rover to
 | |
| 	generate sausages with backtrace information if input N-best lists
 | |
| 	contain it (using -nbest-backtrace option).
 | |
| 	* New tool nbest-pron-score for computing pronunciation and pause LM
 | |
| 	scores from N-best hypotheses.
 | |
| 	* Added disambig -totals option to compute total string probabilities
 | |
| 	(same as in hidden-ngram).
 | |
| 	* reverse-lm: simple filter to reverse a bigram backoff LM.
 | |
| 	* lattice-tool -collapse-same-words reduces lattices by merging all
 | |
| 	nodes with identical words (but also creates new paths in lattice).
 | |
| 	* nbest-lattice -prime-with-refs option uses reference strings
 | |
| 	to improve sausage alignment.
 | |
| 	* compute-best-sentence-mix: new script to optimize sentence-level
 | |
| 	interpolation of LMs.
 | |
| 	* nbest-lattice -lattice-files option to align multiple word lattices;
 | |
| 	currently only works with -use-mesh (sausages).
 | |
| 	* hidden-ngram now supports mixture and class N-gram LMs.
 | |
| 	* New class SimpleClassNgram, a more efficient implementation of 
 | |
|  	ClassNgram's where each word is assumed to belong to at most one
 | |
| 	class and class expansions are exactly one word long.
 | |
| 	Enabled by -simple-classes switch in ngram, lattice-tool, and 
 | |
| 	hidden-ngram.
 | |
| 	* ngram -counts now handles escaped input lines and LM state change
 | |
| 	directives embedded in the input.
 | |
| 	* New tool nbest-pron-score for scoring pronunciations and pauses in
 | |
| 	N-best hypotheses.
 | |
| 	* NgramStats::parseNgram() new function to parse N-gram counts from
 | |
| 	a character string.
 | |
| 	* LM::pplCountsFile() new function to evaluate LM on counts read from
 | |
| 	a file.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* make-ngram-pfsg is no longer limited to trigram models.
 | |
| 	* Avoid NaN values in disambig and hidden-ngram, in cases where lmw or
 | |
| 	mapw are zero and the corresponding log probabilities are -Infinity.
 | |
| 	* Avoid numerical problems in N-best posterior computation by using
 | |
| 	AddLogP() to compute normalizer.
 | |
| 	* anti-ngram no longer requires -refs argument with -all-ngrams.
 | |
| 	* Fixed bug removing noise from N-best lists with backtrace.
 | |
| 	* Code fixes for clean compiles with gcc 3.x.
 | |
| 	* nbest-rover more efficient by using a single invocation of
 | |
| 	nbest-lattice for all input N-best lists.
 | |
| 	* ClassNgram: fixed handling of words that appear as members of a class
 | |
| 	with zero probability, or have zero membership probability.
 | |
| 	* nbest-lattice -record-hyps now outputs hyp ids according to the 
 | |
| 	original N-best order, rather than the sorted one.
 | |
| 	* make-hiddens-lm now gives proper unigram probability to hidden-S tag.
 | |
| 	* Compute acoustic scores in Decipher N-best-2 format by subtracting
 | |
| 	token LM scores from total score.  This deals correctly with cases where
 | |
| 	the total scores have been adjusted by summing merged hyps, and are no
 | |
| 	longer the sum of all AC and LM word scores.
 | |
| 	* Gawk scripts that test for alphabetic or lowercase characters are
 | |
| 	more portable and handle non-ascii and multibyte characters.
 | |
| 
 | |
| 	The package now includes a paper on SRILM, to appear in ICSLP-2002,
 | |
| 	that gives an overview of the software and its design (doc/paper.ps).
 | |
| 
 | |
| 1.3.2	3 September 2002
 | |
| 
 | |
| 	New functionality:
 | |
| 	
 | |
| 	* Added ngram-count and ngram-count -nonevents option to specify a 
 | |
| 	subset of words that are to be non-events, i.e., tokens that can only
 | |
| 	occur in contexts (such as <s>).
 | |
| 	* Extended ngram-count discounting options for up to 9-grams.
 | |
| 	* Added support in Vocab and Ngram classes for processing meta-counts
 | |
| 	(counts-of-counts).
 | |
| 	* Added ngram-count -meta-tag and -kn-counts-modified options to
 | |
| 	support make-big-lm.
 | |
| 	* Added ngram-count -read-with-mincounts flag to suppress counts
 | |
| 	below cuttoff thresholds at reading time.  This dramatically lowers
 | |
| 	memory consumption, and speeds up make-big-lm operation (which used 
 | |
| 	to use a gawk script for the same purpose).
 | |
| 	* Added option to specify vocabulary to add-pauses-to-pfsg for cases
 | |
| 	where heuristics fail.
 | |
| 	* lattice-tool can now handle arbitrary order LMs for expanding 
 | |
| 	lattices.  The old trigram expansion algorithm is still available 
 | |
| 	with -old-expansion; the compact trigram algorithm is unchanged with
 | |
| 	-compact-expansion.
 | |
| 	* To better support lattice expansion, two new functions have been
 | |
| 	added to the LM interface: contextID() takes an optional word
 | |
| 	argument, to compute the context needed to predict a specific word,
 | |
| 	and contextBOW() is a new interface to compute the backoff weight
 | |
| 	associated with truncating a history.
 | |
| 	* Added makefile support to generate executable versions that use
 | |
| 	"compact" data structures.  See item 9 in INSTALL for details, and
 | |
| 	doc/time-space-tradeoff for a simple benchmark result.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Convert pseudo-log(0) value (-99) in DARPA backoff models back to
 | |
| 	true log(0) on reading.  This ensures that non-event words in the
 | |
| 	input are treated as zeroprobs (by the perplexity computation and
 | |
| 	otherwise).
 | |
| 	* Avoid NaN floating point results in N-best rescoring and
 | |
| 	nbest-optimize, by handling 0 * log(0) more carefully.
 | |
| 	* Handle -Inf AM and LM scores in SRILM N-best format.
 | |
| 	* make-big-lm was reworked to support KN in addition to GT discounting.
 | |
| 	Warning: the modified lower-order counts for KN are created using
 | |
| 	merge-batch-counts and can get almost as big as the original counts.
 | |
| 	Beware of the additional disk space and run time requirement!
 | |
| 	* Clear out old parameters before reading or estimating N-gram models.
 | |
| 	* Reading in new class definitions into ClassNgram object now deletes
 | |
| 	old definitions (unless classes file is empty).
 | |
| 	* Destructors for Ngram and ClassNgram now free N-gram and class 
 | |
| 	definition memory.
 | |
| 	* nbest-pron-score: avoid core dump when pronunciation information is
 | |
| 	missing from N-best list.
 | |
| 	* make-ngram-pfsg: fixed generation of unigram PFSGs.
 | |
| 	* Avoid use of toupper() in add-pauses-to-pfsg.
 | |
| 	* Handle ngram-count -order 0 and print warning.
 | |
| 	* Avoid using zcat in scripts since it behaves differently on different
 | |
| 	systems and depending on PATH setting.
 | |
| 	* nbest-lattice and nbest-optimize no longer strip a filename part
 | |
| 	following '.' to derive utterance ids; only known file suffixes
 | |
| 	are removed.
 | |
| 	* Fixed bugs in member declarations that were preventing TaggedVocab,
 | |
| 	TaggedNgramStats, and StopNgramStats from working correctly.
 | |
| 	* compute-sclite now ignores utterances with a reference of 
 | |
| 	"ignore_time_segment_in_scoring", consistent with NIST STM scoring.
 | |
| 	* Vocab.h now defines SArray_compareKey() for strings over VocabIndex,
 | |
| 	allowing use as keys in sorted arrays.
 | |
| 	* ClassNgram now uses the processed words as the context after an OOV.
 | |
| 	This works better when the input contains context cue tags.
 | |
| 	* i386-solaris platform was not being detected by machine-type script.
 | |
| 
 | |
| 1.3.3	2 March 2003
 | |
| 
 | |
| 	New functionality:
 | |
| 
 | |
| 	* Increased maximum number of interpolated LMs in ngram, hidden-ngram,
 | |
| 	and lattice-tool to 10.
 | |
| 	* ngram now computes static interpolation (N-gram merging) of up to 10
 | |
| 	input LMs (consistent with handling of dynamic interpolation).
 | |
| 	* ngram and lattice-tool -limit-vocab option limits LM reading to
 | |
| 	those parameters that pertain to words specified by -vocab.
 | |
| 	The LM:read() function got an optional second argument for this
 | |
| 	purpose.  
 | |
| 	ngram -limit-vocab -renorm now effectively does the same as the 
 | |
| 	change-lm-vocab script.  However, the main purpose of -limit-vocab
 | |
| 	is to save memory by discarding N-grams that are not relevant to a 
 | |
| 	test set.
 | |
| 	* rescore-decipher -limit-vocab precomputes the vocabulary used by
 | |
| 	N-best lists and invokes ngram -limit-vocab to allow rescoring with 
 | |
| 	very large models on machines with little memory.
 | |
| 	* Ngram::mixProbs() now has version that destructively merges an Ngram
 | |
| 	into an existing model.  ngram -mix-lm now uses this version, instead
 | |
| 	of the old, non-destructive one, thereby achieving considerable time
 | |
| 	and space savings (only two models, rather than 3, have to be kept in
 | |
| 	memory at a time).
 | |
| 	* ngram-count and ngram -map-unk option, to change the "unknown" word
 | |
| 	token string.
 | |
| 	* compute-sclite, compare-sclite now understand multiple -S options to
 | |
| 	specify intersections of several utterance subsets for scoring.
 | |
| 	* make-batch-counts now ignores lines in input file list that start 
 | |
| 	with # (allowing comments in the file list).
 | |
| 	* Added replace-words-with-classes partial=1 option to prevent 
 | |
| 	multi-word replacements that include multiple whitespace characters
 | |
| 	(i.e., "a b" is only replaced with a single space between the words).
 | |
| 	* New LM script: sort-lm, reorders N-grams lexicographically, as 
 | |
| 	required by some other software (e.g., Sphinx3, pointed out by 
 | |
| 	Mikko Kurimo <mikkok@james.hut.fi>).
 | |
| 	* New training script: reverse-text, reverses word order in text file.
 | |
| 	* New pfsg script: pfsg-vocab, extracts vocabulary used in PFSGs.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* disambig and hidden-ngram -keep-unk now also causes LM to be
 | |
| 	treated as  open-vocabulary.
 | |
| 	* HiddenNgram class (debug level 2) was omitting the event after
 | |
| 	the last word from the Viterbi backtrace.
 | |
| 	* ngram -expand-classes was including -pau- word in expanded LM.
 | |
| 	* Made backoff computation in Ngram:wordProbBO() more efficient,
 | |
| 	avoiding multiple lookups in the context trie.  Gives about a 30%
 | |
| 	speedup in ngram -debug 3 -ppl.
 | |
| 	* ngram -lm reading is faster by about 8% due to a code optimization.
 | |
| 	* ngram-count -order 2 -kndiscount3 no longer aborts with an error.  
 | |
| 	The -order option effectively limits the discounting parameters
 | |
| 	computed, so that the model order can be changed without having to
 | |
| 	adjust the smoothing options.
 | |
| 	* make-big-lm -trust-totals option is ignored with KN discounting,
 | |
| 	they don't work well together.
 | |
| 	* make-big-lm now checks that input counts files are not stdin.
 | |
| 	* Reading N-best lists in Decipher format now sets the number-of-words
 | |
| 	score, so that weight rescoring, optimization etc. can use them.
 | |
| 	* ngram-count normalizes the N-gram probabilities for a context to 1
 | |
| 	if the backoff distribution for that context has probability mass 0.
 | |
| 	The latter can happen e.g. if all N-grams for a context have been
 | |
| 	observed and received discounted probabilities.  The fix ensures that
 | |
| 	the overall distribution is normalized in this case.
 | |
| 	* rescore-reweight now accepts Decipher N-best lists.
 | |
| 	* nbest-posteriors and nbest-rover now handle Decipher version 2
 | |
| 	N-best lists better (allowing LM and WT weights to be applied).
 | |
| 	* Initialize locale in all top-level programs.  disambig, hidden-ngram,
 | |
| 	segment, and segment-nbest were missing it, causing potential problems
 | |
| 	with non-ASCII characters.
 | |
| 	* nbest-lattice -write-vocab option to find vocabulary used in N-best
 | |
| 	list.
 | |
| 	* nbest-pron-score now uses idFromFilename() function to avoid 
 | |
| 	over-truncating filenames when inferring sentence ids.
 | |
| 	* Added more strippable filename suffixes in idFromFilename() function.
 | |
| 	* NBest: correctly read in phone backtraces that are time-reversed.
 | |
| 	* compute-oov-rate ignores -pau- tokens.
 | |
| 	* Various N-best scripts now process input directories containing links
 | |
| 	(rather than plain files) correctly.
 | |
| 	* Lattice class takes care to limit range of intlog transition
 | |
| 	probabilities in PFSG output, so as to avoid overflow when converting
 | |
| 	to bytelog scale.
 | |
| 	* make-ngram-pfsg removes temporary file (now placed in /tmp) even
 | |
| 	when killed by signal.
 | |
| 	* Hidden-event and DF N-gram models are documented in detail in ngram
 | |
| 	man page.
 | |
| 	* Test suite result comparisons against reference output now use a 
 | |
| 	script that ignores small numerical discrepancies, so as to produce 
 | |
| 	fewer false alarms.
 | |
| 
 | |
| 	Portability:
 | |
| 
 | |
| 	* Compiles under MacOS X (MACHINE_TYPE=macosx), thanks to help from
 | |
| 	wooters@icsi.berkeley.edu and jean-philippe.demoulin@enst.fr.
 | |
| 
 | |
| 1.4	14 February 2004
 | |
| 
 | |
| 	New functionality:
 | |
| 
 | |
| 	* Added support for factored language models, developed by Katrin
 | |
| 	Kirchhoff and Jeff Bilmes, and implemented by Jeff Bilmes.
 | |
| 	A new library, libflm.a, and two new tools, fngram-count and fngram
 | |
| 	are built in the flm/ directory.  A conference paper and a technical
 | |
| 	report are included as documentation in flm/doc/.  Questions and bug
 | |
| 	reports should be directed to bilmes@ee.washington.edu.
 | |
| 	FLM support has also been integrated into some of the standard 
 | |
| 	tools (ngram and hidden-ngram) and is enabled by the -factored option.
 | |
| 
 | |
| 	* Added support in lattice-tool to read/write and rescore HTK lattices.
 | |
| 	See lattice-tool man page for details.
 | |
| 	* The lattice expansion algorithm for general LMs now preserves
 | |
| 	pause and null nodes.  Consequently, lattice-tool no longer eliminates
 | |
| 	pause and null nodes prior to applying this algorithm, unless 
 | |
| 	-no-pause or -compact-pause was specified. 
 | |
| 	* Implemented a new algorithm to build word meshes (confusion networks,
 | |
| 	sausages) from lattices, that is faster than the original Mangu et al.
 | |
| 	method.  lattice-tool -posterior-decode uses this to extract 1-best
 | |
| 	word hypotheses, and lattice-tool -write-mesh allows writing of
 | |
| 	sausages to file.
 | |
| 	* The "compact" lattice expansion algorithm that uses backoff nodes
 | |
| 	(described in Weng et al. 1998) has been generalized to handle
 | |
| 	LMs of arbitrary order.  As before, this algorithm is triggered by	
 | |
| 	lattice-tool -compact-expansion.  (To get the old version, which
 | |
| 	handles only trigrams and produces non-identical results, use
 | |
| 	lattice-tool -compact-expansion -old-expansion.)
 | |
| 	* lattice-tool -density allows pruning of lattices to a specified 
 | |
| 	density (in addition to the posterior threshold).
 | |
| 	* lattice-tool -multi-char option allows designating characters other
 | |
| 	than underscore as multiword delimiters.
 | |
| 	* Added a "LatticeLM" class that emulates a language model using the
 | |
| 	transition probabilities in a lattice.  This is useful for debugging
 | |
| 	and comparing the probabilities assigned by lattices to corresponding
 | |
| 	LM probabiltiies.  A new option lattice-tool -ppl makes use of this
 | |
| 	class (analogous to ngram -ppl).
 | |
| 	* lattice-tool lattice algebra operations (or, concatenate) can now
 | |
| 	be applied to multiple input lattices, always using the same lattice
 | |
| 	as second operand.
 | |
| 
 | |
| 	* ngram has enhanced N-best rescoring functionality, allowing 
 | |
| 	multiple input lists to be rescored (-nbest-files, -write-nbest-dir,
 | |
| 	-decipher-nbest, -no-reorder, -split-multiwords).
 | |
| 	* rescore-decipher -fast enables a faster rescoring mode that uses 
 | |
| 	only the built-in functions of ngram, thus running much faster.
 | |
| 	* New option ngram -rescore-ngram to recompute the probabilities in
 | |
| 	an N-gram model using an arbitrary other LM.
 | |
| 
 | |
| 	* Added original (unmodified) Kneser-Ney discounting (ngram-count
 | |
| 	-ukndiscountN options). Contributed by Jeff Bilmes.
 | |
| 	* New disambig -classes option to read vocabulary maps in
 | |
| 	classes-format(5).
 | |
| 	* New disambig -write-counts option to output word/class substitution
 | |
| 	bigram counts (useful to reestimate class membership probabilities).
 | |
| 	* nbest-pron-score -pause-score-weight creates weighted combination
 | |
| 	of pronunciation and pause LM scores.
 | |
| 	* compute-sclite -noperiods option to delete periods from hyps
 | |
| 	for scoring purposes.
 | |
| 	* New script empty-sentence-lm to modify existing LM to allow
 | |
| 	the empty sentence with a given probability.
 | |
| 	* compute-sclite handles CTM files in RT-03 format.
 | |
| 	* ngram-class -debug 2 prints the initial word-to-class assignments,
 | |
| 	so that the entire class tree can be reconstructed from the output.
 | |
| 	* RefList class has option to read and look up reference words without
 | |
| 	associated ID strings (indexed by integers).
 | |
| 	* Enhanced WordMesh and WordLattice classes to have an optional
 | |
| 	"name" field, used to record utterance ids. 
 | |
| 	* New select-vocab command to implement likelihood-optimizing
 | |
| 	vocabulary selection from multiple corpura.  Contributed by 
 | |
| 	Anand Venkataraman and Wen Wang. See man page for details.
 | |
| 	
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* ngram avoids reading classes file multiple times if -limit-vocab
 | |
| 	is not being used (otherwise it is unavoidable, and will lead to
 | |
| 	errors if the reading is from stdin).
 | |
| 	* Fixed some bugs in compare-sclite and compute-sclite.
 | |
| 	* Modified ngram and compute-best-mix so that the latter works
 | |
| 	with ngram -counts output.  ngram -counts now outputs the count
 | |
| 	values != 1 for each N-gram so that compute-best-mix can take them
 | |
| 	into account in the optimization.
 | |
| 	* rescore-reweight and nbest-rover were not handling Decipher N-best
 | |
| 	lists correctly when additional score directories are given.
 | |
| 	* nbest-rover -wer disables use of nbest-lattice -use-mesh option,
 | |
| 	so nbest-rover can be used for old-style word error minimization
 | |
| 	(or even 1-best rescoring, by also specifying -max-rescore 1).
 | |
| 	* lattice-tool -ref-file and -ref-list were being ignored when
 | |
| 	processing only a single input lattice.  Fixed so that lattice error
 | |
| 	can now be computed with either -input-lattice or -input-lattice-list.
 | |
| 	* Enhanced MultiwordLM class with new contextID() and contextBOW()
 | |
| 	versions that better reflect the backoff behavior of the wrapped LM
 | |
| 	class.  Makes it much more efficient to use the lattice-tool -multiword
 | |
| 	option, i.e., expand a multiword lattice with a non-multiword LM.
 | |
| 	* rescore-decipher -pretty had a bug that caused mapping to be applied
 | |
| 	to the score fields as well, potentially corrupting the format.
 | |
| 	* Fixed bugs in mixture lambda computation (ngram, hidden-ngram,
 | |
| 	lattice-tool), triggered by more than one lambda being zero, or using
 | |
| 	more than 5 mixtures.
 | |
| 	* lattice-tool algebra operations used to crash if operand lattices
 | |
| 	contained NULL nodes.
 | |
| 	* Non-compressed files ending in .gz can now be read successfully.
 | |
| 	* Catch a possible 0/0 problem in the Good-Turing discount estimator.
 | |
| 	* Fixed memory management for strings returned by TaggedVocab::getWord()
 | |
| 	thereby avoiding garbled results.
 | |
| 	* lattice-tool -pre-reduce-iterate and post-reduce-iterate arguments
 | |
| 	where not being used to control number of lattice reduction iterations.
 | |
| 	* Fixed an unitialized memory bug that could produce random results
 | |
| 	in posterior probability computation (and hence in lattice pruning).
 | |
| 	* Fixed a bug in lattice pruning triggered by unnormalized posteriors
 | |
| 	greater than 1.
 | |
| 
 | |
| 	Portability:
 | |
| 
 | |
| 	* Fixed some problems compiling with gcc-3.2.2; eliminated compile-
 | |
| 	time warnings about division by zero in constant definitions.
 | |
| 	* Rewrote some code to work around limitations and warnings in the
 | |
| 	Intel C++ compiler.  (In return, got compiled code that runs 10-20%
 | |
| 	faster!)  For processor-specific optimizations, use
 | |
| 		make MACHINE_TYPE=i686-p4	.
 | |
| 	* Fixed some script problems that surfaced in latest gawk version.
 | |
| 	* Fixed some problems compiling with Tcl/Tk-8.4.1.
 | |
| 	* FreeBSD support (contributed by Zhang Le <ejoy@peoplemail.com.cn>).
 | |
| 	* Updated Nuance-related features in PFSG scripts and man page.
 | |
| 	* Note: Integration of FLM support required some changes to the 
 | |
| 	Vocab and Ngram class interface.  In particular, several member
 | |
| 	variables (e.g., Boolean Vocab::unkIndex) have been replaced by virtual
 | |
| 	member functions that return references to the variables (e.g.,
 | |
| 	Boolean &Vocab::unkIndex()).  This requires, albeit trivial, changes
 | |
| 	to any client code that accesses these variables.
 | |
| 
 | |
| 1.4.1	9 May 2004
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* New option lattice-tool -htk-quotes to enable the HTK quoting 
 | |
| 	mechanism that allows whitespace and non-printable characters to be
 | |
| 	used in word labels.  (This is disabled by default since other SRILM
 | |
| 	tools don't allow such word strings.)
 | |
| 	* New option lattice-tool -add-refs to add a path corresponding to
 | |
| 	the reference word string to each lattice.
 | |
| 	* New option ngram -counts-entropy to compute entropy (log probabilties
 | |
| 	weighted by joint N-gram probability) from counts.
 | |
| 
 | |
| 	Bugs fixed:
 | |
| 
 | |
| 	* nbest-lattice could core dump if references where not supplied.
 | |
| 	* FLM/ProductVocab: fixed problems with mapping of <s> and </s> to
 | |
| 	factored form.
 | |
| 	* Lattice algebra operations (or, concatenate) now preserve HTK link
 | |
| 	information and lattice names.
 | |
| 	* Fixed LM::contextProb() handling of <s> and other non-event tokens.
 | |
| 	This also allowed Ngram:computeContextProb() to be eliminated.
 | |
| 	* LatticeFollowIter iterator no longer takes lookahead parameter --
 | |
| 	lookahead is unlimited and cycles are avoided by keeping a table of
 | |
| 	visited nodes.  This also greatly speeds up lattice expansion in
 | |
| 	some cases.
 | |
| 	* Detect negative discounts in modified Kneser-Ney method, arising 
 | |
| 	from non-monotonic counts-of-counts.
 | |
| 	* Fixed various debugging output messages in the Lattice class.
 | |
| 
 | |
| 	Portability:
 | |
| 
 | |
| 	* Matthias Thomae <thomae@ei.tum.de> found that make-ngram-pfsg
 | |
| 	(and probably other gawk scripts) may not work correctly with recent
 | |
| 	versions of gawk unless the environment is set to LC_NUMERIC=C.
 | |
| 
 | |
| 1.4.2	19 October 2004
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* lattice-tool -factored option to handle factored LMs (analogous
 | |
| 	to ngram and hidden-ngram).
 | |
| 	* lattice-tool -nbest-decode generates N-best lists from lattices
 | |
| 	(contributed by Dustin Hillard, University of Washington).
 | |
| 	* lattice-tool -output-ctm option to generate CTM-formatted 1-best
 | |
| 	output, either with -viterbi-decode or with -posterior-decode.
 | |
| 	Of course this requires HTK input lattices containing timemarks.
 | |
| 	* Added version of WordMesh::minimizeWordError() that returns acoustic
 | |
| 	information in a NBestWordInfo array, to support the above.
 | |
| 	* lattice-tool -insert-pause option to insert optional pause nodes in
 | |
| 	lattices.
 | |
| 	* lattice-tool -unk will map unknown words to <unk> instead of 
 | |
| 	automatically augmenting the vocabulary (the -map-unk option allows
 | |
| 	the mapping of unknown words to be customized).
 | |
| 	* lattice-tool -acoustic-mesh records word times, scores, and phone
 | |
| 	alignments when confusion networks are built.
 | |
| 	* lattice-tool -ignore-vocab option to define the set of words that
 | |
| 	are ignored in LM processing (like pause nodes).
 | |
| 	* lattice-tool -write-ngrams option to compute expected N-gram counts
 | |
| 	from lattices.
 | |
| 	* HTK lattices now supports up to three "extra" score fields (x1..x3),
 | |
| 	which can be used to rescore hypotheses with arbitrary non-standard
 | |
| 	knowledge sources.
 | |
| 	* Added support for the "s" key in HTK lattices (used to encode
 | |
| 	state alignment info).
 | |
| 	* anti-ngram -min-count option to prune N-grams with expected frequency
 | |
| 	below specified threshold.
 | |
| 	* ngram -adapt-marginals and related options to trigger use of
 | |
| 	unigram marginals adaptation, following Kneser et al. (Eurospeech 97).
 | |
| 	* New LM class AdaptMarginals to support the above.
 | |
| 	* nbest-lattice and lattice-tool -hidden-vocab option allows specifying
 | |
| 	a subvocabulary that should not be aligned with regular words when
 | |
| 	building confusion networks.
 | |
| 	* New VocabDistance subclass SubvocabDistance, to support the above.
 | |
| 	* nbest-optimize -combine-linear and -non-negative options, useful to
 | |
| 	optimize linear combinations of posterior probability scores.
 | |
| 
 | |
| 	Bugs fixed:
 | |
| 
 | |
| 	* lattice-tool: Avoid disconnecting lattice in density pruning.
 | |
| 	* Utility script installation was not working for Cygwin hosts.
 | |
| 	* ProductNgram::contextID() now returns hash code of context used,
 | |
| 	instead of zero, and limits context-used length to order-1.
 | |
| 	* HTK lattice output was omitting wdpenalty value.
 | |
| 	* Improved collision-prone hash function for VocabIndex arrays.
 | |
| 	* Documented order of operations in lattice-tool(1).
 | |
| 	* Fixed excessive /tmp space usage in nbest-rover script, so as to
 | |
| 	avoid frequent incomplete output with large N-best data as a result
 | |
| 	of running out of disk space.
 | |
| 	* Fixed bug in compute-sclite that would garble STM references without
 | |
| 	the optional 6th field.
 | |
| 	* Fixed bug in Trie::insert(), which would always set foundP = true,
 | |
| 	even if a new entry was created.
 | |
| 	* Preserve Lattice:limitIntlogs flags in lattice algebra operations.
 | |
| 	* Use sorted node map iteration in lattice-tool expansion algorithms,
 | |
| 	so that results are not subject to pseudo-random hash table ordering.
 | |
| 	* HTK lattice output no longer has more nodes/links than input
 | |
| 	(provided -no-htk-nulls, -htk-scores-on-nodes, or -htk-words-on-nodes
 | |
| 	are NOT used).
 | |
| 	* Take default lattice name from input filename, rather than output
 | |
| 	filename (which may not be defined), however:
 | |
| 	* The embedded names of output lattices from binary lattice operations
 | |
| 	are derived from the output file name.
 | |
| 	* Fixed bug in reading of word meshes (confusion networks) introduced
 | |
| 	in release 1.4.
 | |
| 	* Fixed a bug in alignments of multiple confusion networks, affecting
 | |
| 	cases where the inputs have posterior masses != 1.
 | |
| 
 | |
| 1.4.3	3 December 2004
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* Increased the number of extra scores supported in HTK lattices
 | |
| 	(x1, x2, ... x9).
 | |
| 	* lattice-tool -nbest-viterbi option to use Viterbi N-best algorithm,
 | |
| 	which uses less memory (contributed by Jing Zheng).  
 | |
| 	* Added nbest-lattice -output-ctm analoguous to lattice-tool.
 | |
| 	* Make -output-ctm output word posteriors in the confidence field.
 | |
| 	* Extend the meaning of the nbest-lattice -max-rescore option so that,
 | |
| 	in lattice mode, it limits the number of hypotheses that are aligned.
 | |
| 	(The meaning of -max-rescore was previously only defined in N-best
 | |
| 	rescoring mode).
 | |
| 	* Added -version option to all top-level programs.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Improved efficiency and duplicate elimination in A-star N-best
 | |
| 	generation (contributed by Jing Zheng).
 | |
| 	* Worked around a problem with gawk scripts in Linux handling of
 | |
| 	/dev/stderr device which can cause a file to be truncated if stderr is
 | |
| 	redirected to it.
 | |
| 	* MultiAlign::addWords() was not preserving NBestWordInfo.
 | |
| 
 | |
| 	Other:
 | |
| 
 | |
| 	* Various small code changes for compilation with gcc 3.4.3.
 | |
| 	* Maintenance scripts moved to $SRILM/sbin/.
 | |
| 	* Support for commercial releases excluding third-party code
 | |
| 	contributions.
 | |
| 
 | |
| 1.4.4	6 May 2005
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* ngram-count now allows use of -wbdiscount, -kndiscount, etc.,
 | |
| 	without a specified N-gram order, to set the default discounting
 | |
| 	method for all N-gram orders.  As before, this can be overridden by
 | |
| 	-wbdiscount[1-9], -kndiscount[1-9], etc., for specific N-gram
 | |
| 	lengths (suggested by Anand).
 | |
| 	* lattice-tool -keep-pause has additional side-effects if used with
 | |
| 	-nonevents and -ignore-vocab (making pauses behave like regular words).
 | |
| 	* lattice-tool -dictionary-align option triggers use of dictionary
 | |
| 	pronunciations for word mesh alignment (contributed by Dustin Hillard).
 | |
| 	* New option lattice-tool -nbest-duplicates allows control over the
 | |
| 	number of duplicate word hypotheses to output (from Dustin Hillard).
 | |
| 	* Update to the FLM tools from Kevin Duh, to make fngram-count use the
 | |
| 	-vocab option to limit the vocabulary of the estimated model.
 | |
| 	* Added nbest-optimize -hidden-vocab option to constrain the alignment
 | |
| 	of a subvocabulary (analogous to nbest-lattice -hidden-vocab).
 | |
| 	* wlat-stats computes the posterior expected number of words in the
 | |
| 	input lattice.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* ngram -unk maps unknown words in N-best hyps to <unk> instead of
 | |
| 	adding them to the vocabulary.
 | |
| 	* lattice-tool: Don't punt when encountering a NULL word node with
 | |
| 	pronunciation, output a warning instead.
 | |
| 	* lattice-tool -nbest-decode now uses a double-ended heap data
 | |
| 	structure, and -nbest-max-stack drops hypotheses from the bottom
 | |
| 	of the heap instead of the top (contributed by Dustin Hillard).
 | |
| 	* lattice-tool -nbest-decode now does more thorough duplicate removal
 | |
| 	(not just adjacent duplicates are removed).
 | |
| 	* lattice-tool no longer gives an error if input lattice has posteriors
 | |
| 	specified on nodes (even though they are effectively ignored).
 | |
| 	* select-vocab: miscellaneous bug fixes from Anand.
 | |
| 	* nbest-lattice: fixed various bugs with -nbest-backtrace option.
 | |
| 	* compute-sclite: work around bug in csrfilt.sh -dh affecting waveform
 | |
| 	names containing hyphens.
 | |
| 	* Minor tweaks for MacOSX build.
 | |
| 
 | |
| 1.4.5	28 August 2005
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* ngram -debug 0 -ppl now outputs statistics for each input section
 | |
| 	delimited by escape lines, in addition to overall results (based on
 | |
| 	a modification by Dustin Hillard).  ngram -debug 1 and higher behave as
 | |
| 	before.
 | |
| 	* ngram -loglinear-mix implements log-linear mixture LMs.
 | |
| 	* LoglinearMix: new class to support the above.
 | |
| 	* VocabMap: added remove(.) method to remove all entries for given
 | |
| 	source word.
 | |
| 	* WordMesh: added wordColumn() function to return confusion set at
 | |
| 	given position (contributed by Dustin).
 | |
| 	* Lattice: added readMesh() function to read in confusion networks
 | |
| 	(from Dustin).
 | |
| 	* lattice-tool -read-mesh allows handling in confusion network format
 | |
| 	(from Dustin).
 | |
| 	* nbest-optimize -1best-first implements a heuristic strategy whereby
 | |
| 	the relative score weights are first optimized in -1best mode, followed
 | |
| 	by full optimization together with posterior scale.
 | |
| 	* nbest-optimize -max-time forces search to time out if new best
 | |
| 	weights aren't found within a certain number of seconds.
 | |
| 	* New script combine-rover-controls to merge multiple nbest-rover
 | |
| 	control files for system combination.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* disambig clears old map entries when encountering a duplicate
 | |
| 	definition for a source word.
 | |
| 	* nbest-optimize: posterior scaling of fixed weights was broken.
 | |
| 	* WordMesh, nbest-lattice: do better error checking on reading
 | |
| 	confusion network files, handle numalign and posterior specs out of
 | |
| 	order.
 | |
| 	* lattice-tool had a bug in the handling of HTK format lattices that
 | |
| 	do not contain an explicit specification of initial/final nodes.
 | |
| 	* Added proper copy constructors and assignment operators for
 | |
| 	Array, SArray, and LHash classes.  This in turn makes the copy
 | |
| 	constructor for NgramLM and other classes work properly.
 | |
| 	(Assignment still doesn't work for some higher-level classes because
 | |
| 	of reference (&) variable members.)
 | |
| 	* Fixed minor bug in the ngram -skipoovs implementation, found by
 | |
| 	Alexandre Patry.
 | |
| 
 | |
| 	Portability:
 | |
| 
 | |
| 	* Port to win32-mingw platform (by Jing Zheng).  Doesn't support
 | |
| 	compressed file i/o, or the -max-time options in nbest-optimize and
 | |
| 	lattice-tool.
 | |
| 	* Minor tweaks for compilation with gcc-4.0.1.
 | |
| 	* Renamed HTKLink class to HTKWordInfo, which is more appropriate and
 | |
| 	avoids a naming conflict with SRI's Decipher software.
 | |
| 
 | |
| 1.4.6	20 January 2006
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* Added support for reading/writing files compressed with bzip2
 | |
| 	(file suffix .bz2).  Requires that the bzip2/bunzip2 binaries be
 | |
| 	installed.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Lattice class now creates completely empty lattices (no nodes).
 | |
| 	This avoids having to first remove a node when reading an actual
 | |
| 	lattice.  Empty lattices can be output, but not read (because at
 | |
| 	least an initial/final node has to be defined).
 | |
| 	* lattice-tool -ignore-vocab was not being used in conjunction with
 | |
| 	-viterbi-decode, -posterior-decode, -collapse-same-words, and lattice
 | |
| 	error computation.  Words to be ignored are now treated same as 
 | |
| 	-noice-vocab in those operations.
 | |
| 	* Fixed a bug in lattice expansion whereby backoff weights were 
 | |
| 	dropped at NULL nodes (problem noticed by Teemu Hirsimaki).
 | |
| 	* Fixed bug in reading of node-specific posterior probabilities
 | |
| 	in word meshes.
 | |
| 	* Fixed a bug in lattice-tool -read-mesh, which was not creating
 | |
| 	sentence initial/final tags on initial/final lattice nodes.
 | |
| 	* Fixed a bug in the LatticeFollowIter class that could cause incorrect
 | |
| 	results in LatticeLM (lattice-tool -ppl).
 | |
| 	* When outputting PFSG lattices in HTK format, map PFSG weights to
 | |
| 	HTK acoustic scores.  (But, as before, LM rescoring discards input
 | |
| 	PFSG weights and causes the probabilities to be output as LM scores.)
 | |
| 	* Scale wdpenalty values specified in lattice according to log-base.
 | |
| 	Also, scale -htk-wdpenalty specified on command line according to
 | |
| 	-htk-logbase (or default 10).
 | |
| 	* Correctly handle HTK score output with -htk-logbase 0.
 | |
| 
 | |
| 	Portability:
 | |
| 
 | |
| 	* Added workaround for compilers that don't support arrays of
 | |
| 	non-constant size (such as SunStudio and Visual C++). On these 
 | |
| 	systems, Array will be used instead.
 | |
| 
 | |
| 	* Added a new compilation option "_s" that triggers use of 2-byte 
 | |
| 	integers for vocabulary indices and counts.  With compilers that
 | |
| 	implement __attribute__((packed)) correctly, this causes N-gram counts
 | |
| 	to use 1/3 less memory than in the default option, at some limitations
 | |
| 	in functionality.  First, only vocabularies of up to 64k words may
 | |
| 	be used.  Second, only up to 32k counts exceeding 32k may be stored.
 | |
| 	The latter is typically not a problem because in most natural data
 | |
| 	the number of very frequent words is small.
 | |
| 	Unfortunately, gcc does not currently handle __attribute__((packed))
 | |
| 	correctly, but Intel's icc does.
 | |
| 
 | |
| 	* Tested on Linux for PowerPC-64bit.
 | |
| 
 | |
| 	* Tested on Linux for x86_64, using gcc.
 | |
| 
 | |
| 	* Minor tweaks for Intel icc 8.0.
 | |
| 
 | |
| 	* Tested on Solaris-x86 using Sun Studio 11 compiler.
 | |
| 	Compilation still generates lots of warnings, but the resulting
 | |
| 	binaries work correctly.
 | |
| 
 | |
| 	* Ported to Microsoft Visual C 7.0 (by Jing Zheng);
 | |
| 	See doc/README.windows-mscv.
 | |
| 
 | |
| 	* gcc versions older than 3.4.3 are no longer supported, though
 | |
| 	they might still work.
 | |
| 
 | |
| 1.5.0	31 July 2006
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* Added support for a binary data format for N-gram backoff models
 | |
| 	which speeds up the reading of model files by a factor of 2 
 | |
| 	for full models, and by an order of magnitude if -limit-vocab is used.
 | |
| 	Note that the binary format is machine architecture dependent.
 | |
| 	See the ngram -write-bin-lm option (contributed by Jing Zheng).
 | |
| 
 | |
| 	* disambig now support Bayesian or standard interpolation of up to 
 | |
| 	10 LMs, just like ngram and hidden-ngram.
 | |
| 
 | |
| 	* Added disambig -factored option to support factored hidden tag LMs.
 | |
| 
 | |
| 	* Added disambig -escape option to pass information unprocessed to 
 | |
| 	the output, similar to hidden-ngram.
 | |
| 
 | |
| 	* New utility script: split-tagged-ngrams, see training-scripts(1)
 | |
| 	man page.
 | |
| 
 | |
| 	* New function Vocab::checkWords() for more efficient implementation
 | |
| 	of the ngram -limit-vocab functionality.
 | |
| 
 | |
| 	* Modified compute-sclite to support scoring of overlapped speech 
 | |
| 	with asclite program.
 | |
| 
 | |
| 	* New NgramCountLM class implementing a mixture of count-based 
 | |
| 	maximum-likelihood estimators (aka deleted interpolation aka
 | |
| 	Jelinek-Mercer smoothing).
 | |
| 
 | |
| 	* ngram-count and ngram -count-lm options to implement deleted 
 | |
| 	estimation and evaluation of NgramCountLM models.
 | |
| 	This option is also supported by hidden-ngram, disambig, and
 | |
| 	lattice-tool.
 | |
| 
 | |
| 	* Added support for ngram counts stored in an indexed directory
 | |
| 	structure, based on a format developed by Thorsten Brants for data
 | |
| 	delivered to LDC by Google.  This data format can be used in
 | |
| 	conjunction with the NgramCountLM class, and may be generated
 | |
| 	from standard ngram count files using the make-google-ngrams script
 | |
| 	(see training-scripts(1)).
 | |
| 
 | |
| 	* Added NgramStats::clear() function.
 | |
| 
 | |
| 	* Added the limitVocab option to the NgramStats::read() function.
 | |
| 	In conjunction with NgramCountLM, this allows use of arbitrarily 
 | |
| 	large N-gram statistic on limited test sets.
 | |
| 
 | |
| 	* Added ngram-count -limit-vocab option.
 | |
| 
 | |
| 	* Added hidden-ngram -vocab and limit-vocab options.
 | |
| 	Possible incompatibility: the -hidden-vocab wordlist must not contain
 | |
| 	the *noevent* word; it is added implicitly.
 | |
| 
 | |
| 	* Added lattice-tool -write-vocab option to extract vocabulary from
 | |
| 	lattice files.
 | |
| 
 | |
| 	* Added lattice-tool -init-mesh option to align lattice to preexisting
 | |
| 	confusion network.
 | |
| 
 | |
| 	* Added an interface for vocabulary aliasing (name mapping) to
 | |
| 	the Vocab class, and the option -vocab-aliases to the programs
 | |
| 	disambig, hidden-ngram, lattice-tool, nbest-lattice,
 | |
| 	ngram-count, and ngram.  This allows direct use of LMs with
 | |
| 	slightly mismatched vocabularies relative to some test data.
 | |
| 	Also, added handling of the -vocab-aliases option to the
 | |
| 	rescore-decipher script, so that large name mapping files can
 | |
| 	be subsetted when -limit-vocab is in effect (so that only the
 | |
| 	relevant portions of an LM are loaded).
 | |
| 
 | |
| 	* disambig now automatically limits LM reading to the words found in
 | |
| 	the map file (suggested by Jing Zheng).
 | |
| 
 | |
| 	* hidden-ngram -bayes and -bayes-length options added to give more
 | |
| 	control over interpolation.
 | |
| 
 | |
| 	* The default count type is now "unsigned long" intead of
 | |
| 	"unsigned int".  This makes no difference on 32-bit platforms,
 | |
| 	but on 64-bit platforms it allows the handling of data upwards of
 | |
| 	4.3 billion tokens (which would causes integer overflow on 32bit
 | |
| 	machines).
 | |
| 
 | |
| 	* For 32-bit platforms, added a compile option "_l", which triggers
 | |
| 	use of 64-bit "long long" integers for count storage.
 | |
| 	This uses the XCount class to avoid needing extra memory for count
 | |
| 	storage, assuming that large count values will be sparse.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Fixed a bug in the handling of -mix-lm[789] options in ngram,
 | |
| 	hidden-ngram and lattice-tool.  (With the -bayes option in effect,
 | |
| 	the -mix-lm6 argument was used for -mix-lm[789].)
 | |
| 
 | |
| 	* Fixed memory management in the XCount implementation, which was 
 | |
| 	giving incorrect results when compiling with OPTION=_s.
 | |
| 
 | |
| 	* disambig no longer adds <s> and </s> tokens if input already 
 | |
| 	contains them (consistent with ngram).
 | |
| 	
 | |
| 	* lattice-tool -read-mesh was broken in the previous release, now
 | |
| 	works again.
 | |
| 
 | |
| 	* lattice-tool -density-prune and -nodes-prune now work without
 | |
| 	-posterior-prune being specified.
 | |
| 
 | |
| 	* The -debug option was being ignored with ngram -null .
 | |
| 
 | |
| 	* Fixed a bug in Vocab::remove(VocabString) that could be triggered by
 | |
| 	interactions between ngam -vocab and -vocab-aliases .
 | |
| 
 | |
| 	* Tweaks to MACHINE_TYPE=msvc compilation.  updated documentation in
 | |
| 	doc/README.windows-cygwin and doc/README.windows-mscv.
 | |
| 	
 | |
| 	* Tweaked compiler flags for Solaris to handle files larger than 2^31.
 | |
| 
 | |
| 	* Prevent possible NaN probabilities in ClassNgram.
 | |
| 
 | |
| 	* Fixed a problem in make-ngram-pfsg triggered by a word named "BO".
 | |
| 
 | |
| 	* Support long int key values in data structures.
 | |
| 
 | |
| 	* rescore-decipher -filter option now works correctly in conjunction
 | |
| 	with -limit-vocab.
 | |
| 
 | |
| 1.5.1	20 November 2006
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* ngram-count -write-binary is a new option to create binary count 
 | |
| 	files, which load much faster.  They are recognized automatically by
 | |
| 	ngram-count -read, and can be used in count-based LMs.
 | |
| 
 | |
| 	* Revised binary backoff LM format (ngram -write-bin-lm) to use only
 | |
| 	a single data file and be machine-independent and somewhat more
 | |
| 	compact.  Reading the 1.5.0 binary format is still supported, but not
 | |
| 	writing it.
 | |
| 
 | |
| 	* Added lattice-tool -bayes and -bayes-scale options for compatibility
 | |
| 	with ngram and other programs.
 | |
| 
 | |
| 	* New lattice-tool -write-ngram-index option to generate an index of
 | |
| 	N-gram occurrences in a lattice.
 | |
| 
 | |
| 	* New lattice-tool -multiword-dictionary option enables accurate 
 | |
| 	handling of acoustic information (timestamps, pronunciations) when the
 | |
| 	-split-multiwords option is used (contributed by Dustin Hillard).
 | |
| 
 | |
| 	* New nbest-optimize -insertion-weight and -word-weights options to
 | |
| 	implement weighted forms of word error optimization.
 | |
| 
 | |
| 	* New option make-ngram-pfsg no_empty_bo=1 to disallow an empty (null)
 | |
| 	path through the PFSG via the unigram backoff.
 | |
| 
 | |
| 	* New script get-unigram-probs to extract unigram probabilities from
 | |
| 	an LM file.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 	
 | |
| 	* Enabled large-file (64bit offsets) handling for Linux 32bit 
 | |
| 	compilation.
 | |
| 
 | |
| 	* Fixed utility and test scripts to support platforms that don't 
 | |
| 	support compressed file I/O.  Check test/README for instructions.
 | |
| 
 | |
| 	* Fixed bug in compute-sclite that could lead to failure if 
 | |
| 	waveform names contain hyphens, or sort differently after mapping to
 | |
| 	lowercase.
 | |
| 
 | |
| 	* Fixed another bug in compute-sclite that was preventing
 | |
| 	compare-sclite from working.
 | |
| 
 | |
| 	* Fixed a typo-bug in Ngram::estimate that could cause problems in
 | |
| 	handling discounting errors, but in practice seems to have been
 | |
| 	harmless (from Federico Cesari).
 | |
| 
 | |
| 	* Improved MSVC portability:
 | |
| 	  - fixed header file usage
 | |
| 	  - enabled binary file i/o for binary LMs
 | |
| 	  - fixed miscellaneous compiler warnings
 | |
| 	  - simplified build (see doc/README.windows-mscv)
 | |
| 	  - workaround in WordMesh.cc to avoid a compiler bug (from
 | |
| 	    Federico Cesari).
 | |
| 
 | |
| 	* Fixed win32 (Windows gcc, not cygwin) build.
 | |
| 
 | |
| 1.5.2	6 March 2007
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* Support binary LM formats (based on Ngram binary format) for most
 | |
| 	LM classes.
 | |
| 
 | |
| 	* New lattice-tool -htk-logzero option to set a dummy score to 
 | |
| 	replace zero scores found in HTK lattices.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Make sure Google ngrams can be read in both compressed and
 | |
| 	uncompressed format if platform supports both.
 | |
| 
 | |
| 	* Make sure the file pointer is updated when reading binary Ngram LM.
 | |
| 	This enables reading multiple LMs from one file, and avoids errors
 | |
| 	reading binary class-LMs.
 | |
| 
 | |
| 	* Avoid NaN values when a lattice score is infinity and the
 | |
| 	corresponding scale factor is 0 (the score is ignored in that case).
 | |
| 
 | |
| 	* Avoid degenerate decoding results if lattice hypotheses contain
 | |
| 	-infinity scores. (Effectively, -infinity is replaced by a large
 | |
| 	negative log score, thus allowing the decoder to rank hypotheses based
 | |
| 	on their non-infinity components.)
 | |
| 
 | |
| 	* Updated lattice-tool man page to clarify the interaction of 
 | |
| 	LM rescoring and lattice decoding.
 | |
| 
 | |
| 	Portability:
 | |
| 
 | |
| 	* Added configuration for Solaris amd64 platform with 
 | |
| 	Sun C compiler (amd64-solaris_spro).
 | |
| 
 | |
| 	* Updated instructions for MSVC build (see doc/EADME.windows-msvc),
 | |
| 	based on imput from Mike Frandsen.
 | |
| 	Merge MSVC .manifest files into binary before installation.
 | |
| 
 | |
| 1.5.3	28 July 2007
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* New ngram-count -write-binary-lm option to output LM in binary format
 | |
| 	(avoids the need to dump ascii format first, and then convert to
 | |
|  	binary using ngram tool).
 | |
| 
 | |
| 	* New make-google-ngrams yahoo=1 option to read Yahoo ngram corpus
 | |
| 	(which needs to be sorted first, however).
 | |
| 
 | |
| 	* New make-big-lm -ngram-filter option to pipe input counts through
 | |
| 	an arbitrary filter program (e.g., for format conversion).
 | |
| 
 | |
| 	* The make-kn-discount utility will now try to estimate missing
 | |
| 	counts-of-counts based on their global statistics, using an empirical
 | |
| 	law: 		log f(k) - log f(k+1) = C / k for some constant C.
 | |
| 	Note this functionality is not implemented in the C++ code for KN
 | |
| 	discounting.  Therefore, it is only available when building LMs with
 | |
| 	make-big-lm.
 | |
| 
 | |
| 	* New scripts tolower-ngram-counts and uniq-ngram-counts to help 
 | |
| 	manipulate counts files.
 | |
| 
 | |
| 	* New option ngram-count -write-vocab-index (for debugging).
 | |
| 
 | |
| 	* Vocab.h: Increased maxWordLength constant from 256 to 1024.
 | |
| 
 | |
| 	* Trie class can now initialize root node size with optional constructor
 | |
| 	argument (similar to other container classes).
 | |
| 
 | |
| 	* LHash and SArray classes have a new function to preallocate space
 | |
| 	following construction (but before any data is inserted).
 | |
| 
 | |
| 	* The platform "i686-p4" has been renamed "i686-icc" (Linux x86 with
 | |
| 	Intel compiler) for consistency.
 | |
| 
 | |
| 	Bugs:
 | |
| 
 | |
| 	* Fixed a buffer overrun problem triggered by nbest rescoring of
 | |
| 	empty hypotheses.
 | |
| 
 | |
| 	* Fixed problem in compute-sclite with extraction of speaker labels
 | |
| 	from ctm files.
 | |
| 
 | |
| 	* NBest class (affecting nbest-pron-score): strip Decipher-specific
 | |
| 	phone diacritic labels separated by underscores from pronunciation
 | |
| 	strings.
 | |
| 
 | |
| 	* Fixed memory leak in Trie::removeTrie().  This was causing a leak
 | |
| 	in NgramLM deallocation.
 | |
| 
 | |
| 	* Fixed a performance bug which caused the building of unigram
 | |
| 	hash tables to have quadratic time complexity (due to an unfortunate 
 | |
| 	interaction between hash table iterators and hash functions).
 | |
| 
 | |
|     	* Made make-big-lm detect missing -read option and print usage message.
 | |
| 	Also, handles degenerate -kndiscount with -order 1 now.
 | |
| 
 | |
| 	* Workaround for icc compiler error: optimization disabled for some
 | |
| 	files when using MACHINE_TYPE=i686-m64-icc.
 | |
| 
 | |
| 1.5.4 	2 November 2007
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* New option ngram-count -addsmooth for additive smoothing.
 | |
| 	A corresponding new discounting subclass "AddSmooth" is defined in 
 | |
| 	Discount.h.
 | |
| 
 | |
| 	* New option ngram -server-port to start a "probability server"
 | |
| 	(based on a contribution by Elad Dinur).
 | |
| 
 | |
| 	* WordLattice: print lattice name in warning messages.
 | |
| 
 | |
| 	* lattice-tool -keep-unk option to preserve labels of OOV words in
 | |
| 	LM rescoring (currently works only for HTK lattices).
 | |
| 
 | |
| 	* New option nbest-optimize -anti-refs and -anti-ref-weight to 
 | |
| 	decorrelate errors with another set of hypotheses.
 | |
| 
 | |
| 	* New support in nbest-optimize for BLEU optimization and Powell search
 | |
| 	(from Jing Zheng).
 | |
| 
 | |
| 	* New option ngram-class -save-maxclasses to start the saving of 
 | |
| 	intermediate results when a specified number classes is reached
 | |
| 	(suggested by Shlomo Wavrow and Mats Svenson).
 | |
| 
 | |
| 	Bugs:
 | |
| 
 | |
| 	* Fixed incorrect reference output for test "nbest-rover-acoustic".
 | |
| 
 | |
| 	* Fixed a possible problem with tests "ngram-class" and
 | |
| 	"ngram-count-lm-limit-vocab" in non-C locales.
 | |
| 
 | |
| 	* nbest-lattice: Avoid aligning reference words with -dump-errors or
 | |
| 	-wer, which would cause crash because no lattice is being generated
 | |
| 	internally.
 | |
| 
 | |
| 	* make-batch-counts, merge-batch-counts: be more portable by dynamically
 | |
| 	finding the right options to use with xargs.
 | |
| 
 | |
| 	* add-pauses-to-pfsg: Avoid using a regular expression construct that
 | |
| 	causes a gawk error in UTF-8 locales.  However, to ensure this works
 | |
| 	correctly a gawk version of 3.1.5 should be used. See note in
 | |
| 	doc/README.linux.  If the test "make-ngram-pfsg" fails a workaround is
 | |
| 	to set LANG=C or LANG=en_US and avoid UTF-8.
 | |
| 
 | |
| 	* Fixes an uninitialized member variable in the unary constructor for
 | |
| 	class File, which was causing garbage to be return on the first
 | |
| 	getline().
 | |
| 
 | |
| 	* common/Makefile.machine.macos: Updated Tcl linking instructions
 | |
| 	(from Chuck Wooters).
 | |
| 	
 | |
| 	* Makefile: exit immediately if any of the subdirectories result in
 | |
| 	build errors.
 | |
| 
 | |
| 1.5.5	6 November 2007
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Fixed Makefile problem in binaries depending on libraries that was 
 | |
| 	preventing executables being generated on some platforms.
 | |
| 
 | |
| 	* Fixed a compilation problem with MSVC for nbest-optimize.
 | |
| 
 | |
| 	* Use MSVC _getpid() in ngram -generate random seed initialization.
 | |
| 
 | |
| 1.5.6	2 January 2008
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* New ngram -use-server option to run the client side of a network LM
 | |
| 	server as implemented by ngram -server-port.  Optionally, probabilities
 | |
| 	may be cached in the client (option -cache-served-ngrams).
 | |
| 	Mixtures of one or more network and file-based LMs are also possible.
 | |
| 
 | |
| 	* Likewise, disambig, hidden-gram, and lattice-tool understand the
 | |
| 	-use-server option.
 | |
| 
 | |
| 	* New LMClient class to implement the above (a stub LM subclass that
 | |
| 	queries a server for LM probabilities).
 | |
| 
 | |
| 	* ngram -server-port now behaves like a true server daemon: it handles
 | |
| 	multiple simultaneous or sequential clients, and never exits (unless
 | |
| 	killed).  The number of simultaneous clients may be limited with the
 | |
| 	-server-maxclients option.
 | |
| 
 | |
| 	* Support for 7-zip compressed files (suggested by Alexy Khrabrov).
 | |
| 
 | |
| 	* lattice-tool -split-multiwords will now print a warning message
 | |
| 	about multiwords that were not split because their LM probability was
 | |
| 	non-zero.
 | |
| 
 | |
| 	* LoglinearMix LM class supports n-way mixtures directly, giving more
 | |
| 	efficient implementation for n > 2 than recursive object construction
 | |
| 	in ngram (contributed by Tanel Alumae).
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* MultiwordLM now implicitly adds all words to the vocabulary, so that
 | |
| 	previously unseen multiwords get split.  This has the side effect that
 | |
| 	OOVs will appear as zeroprob words.
 | |
| 
 | |
| 	Documentation:
 | |
| 
 | |
| 	* The doc/FAQ file has been expanded and reformated as a man page.
 | |
| 	It can be viewed with "man srilm-faq" or online at
 | |
| 	http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.html .
 | |
| 	The major content additions are questions about the build
 | |
| 	process, how to build a "Google N-gram LM", smoothing issues,
 | |
| 	and OOV-handling (the latter by Deniz Yuret).  Corrections and
 | |
| 	additions to this document are most welcome!
 | |
| 
 | |
| 	* A new manual page ngram-discount(7) gives a detailed overview of
 | |
| 	smoothing methods found in SRILM (contributed by Deniz Yuret).
 | |
| 
 | |
| 	* The conversion of man pages to html has been enhanced to better
 | |
| 	handle code samples and nested itemized lists.
 | |
| 
 | |
| 1.5.7	14 October 2008
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* make-big-lm -text option allows building of LMs that only contain
 | |
| 	N-gram contexts that are needed for a given test set, thus saving
 | |
| 	space.
 | |
| 
 | |
| 	* ngram-count -intersect option allows reading of counts to be
 | |
| 	restricted to an N-gram subset.
 | |
| 
 | |
| 	* NgramStats added a Boolean switch "intersect" and a method
 | |
| 	setCount(), used for implementing the above.
 | |
| 
 | |
| 	* Allow changing the character used to compound multiwords, using the
 | |
| 	new option -multi-char with ngram, anti-ngram, nbest-lattice,
 | |
| 	nbest-optimize, nbest-pron-score, and several of the nbest-scripts.
 | |
| 
 | |
| 	* New options -no-sos and -no-eos for ngram-count and ngram tools,
 | |
| 	to control the insertion of <s> and </s> tokens around sentences.
 | |
| 
 | |
| 	* New lattice-tool -no-expansion option to decode a lattice with a
 | |
| 	new LM without first expanding the lattice (contributed by Jing Zheng).
 | |
| 
 | |
| 	* New CachedMem mix-in class to implement a caching memory allocator
 | |
| 	(contributed by Jing Zheng).
 | |
| 
 | |
| 	* Added lattice-tool -print-sent-tags option to preserve <s> and </s>
 | |
| 	tags in lattice output format, instead of mapping them to null nodes.
 | |
| 
 | |
| 	Documentation:
 | |
| 
 | |
| 	* Added redirecting http links to non-SRILM program documentation
 | |
| 	in manual pages.
 | |
| 	
 | |
| 	Portability:
 | |
| 
 | |
| 	* Removed SRI-specific paths etc. from common/Makefile.machine.* .
 | |
| 	Added a mechanism that allows site-specific customizations to be 
 | |
| 	recorded in common/Makefile.site.$MACHINE_TYPE to override definitions
 | |
| 	in common/Makefile.machine.$MACHINE_TYPE, without a need to change the
 | |
| 	latter.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Always output the elements of binary count files and ngram LMs
 | |
| 	in index-sorted order (same as the _c program version).  This avoids
 | |
| 	poor performance when reading the data back in.
 | |
| 
 | |
| 	* Fixed LMClient.h so it compiles on win32 and msvc platforms (even
 | |
| 	though it still doesn't do anything, since Unix sockets are not
 | |
| 	supported).
 | |
| 
 | |
| 	* Process ngram-count -writeN options after applying count smoothing,
 | |
| 	so that the effect of any count modifications (e.g., by KN) is seen,
 | |
| 	and consistent with the -write option.
 | |
| 
 | |
| 	* Fixed the timestamps on initial and final nodes of lattice-tool
 | |
| 	-operation or (bug found by gaojie@hccl.ioa.ac.cn).
 | |
| 
 | |
| 	* NgramLM: Handle cases where interpolated discounting leaves no
 | |
| 	backoff probability mass.
 | |
| 
 | |
| 	* AdaptiveMarginals: Now handles words that are added after LM was
 | |
| 	created. This can happen in N-best rescoring and would previously
 | |
| 	cause an assertion failure.
 | |
| 
 | |
| 	* Fixed bugs in IntervalHeap memory allocation, which could
 | |
| 	cause problems in N-best generation from lattices  (from Jing Zheng).
 | |
| 
 | |
| 	* Set LC_NUMERIC=C in make-big-lm to avoid problems with non-C 
 | |
| 	locales for gawk scripts that compute discounting parameters.
 | |
| 
 | |
| 1.5.8	10 May 2009
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* merge-batch-counts -float-counts option for merging of fractional
 | |
| 	counts.
 | |
| 
 | |
| 	* compare-sclite now includes statistical significance computation
 | |
| 	based on a matched-pair Sign test.
 | |
| 
 | |
| 	* Added a Perl tool to compute the cumulative binomial distribution,
 | |
| 	contributed by Brett Kessler and David Gelbart.
 | |
| 
 | |
| 	* Don't output LM server banner message for ngram -use-server -debug 0.
 | |
| 
 | |
| 	* The LM::generateSentence() function now takes option argument to 
 | |
| 	specify sentence prefix that is to be used to condition subsequent
 | |
| 	word generation (suggested by Alexy Khrabrov).  The default is to
 | |
| 	condition on <s> as before, or an empty context if no start-of-sentence
 | |
| 	tag is defined.
 | |
| 
 | |
| 	* A new option ngram -gen-prefixes to read conditioning prefixes
 | |
| 	from a file, and generate random sentences based on them.
 | |
| 
 | |
| 	* New options in nbest-optimize that modify -print-hyps output so that
 | |
| 	only unique hypotheses are included (-print-unique-hyps), and to print
 | |
| 	the original ranks of hypotheses (-print-old-ranks) (from Jing Zheng).
 | |
| 
 | |
| 	* The -version option reports whether support for compressed files
 | |
| 	is available.
 | |
| 
 | |
| 	* Added merge-batch-count -l option to control how many files to merge
 | |
| 	in each iteration.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* ngram-count, NgramLM: disable the Doug Paul smoothing hack (add one
 | |
| 	to denominator when smoothing results in 0 backoff mass) in contexts
 | |
| 	where the entire vocabulary has been observed.
 | |
| 
 | |
| 	* nbest-optimize fixes to the -minimum-bleu-reference functionality
 | |
|  	(from Jing Zheng).
 | |
| 
 | |
| 	* Fixed nbest-optimize bug that was causing incorrect log output with
 | |
| 	gcc 4.x.
 | |
| 
 | |
| 	* Output vocabulary index map in binary ngram count and LM format
 | |
| 	in numerical index order.  This avoids a performance bug whereby
 | |
| 	reading the data structures back into _c binary version could take
 | |
| 	a long time due to inefficient insertion order.
 | |
| 
 | |
| 	* Fix ngram -counts with -use-server (from Ergun Bicici).
 | |
| 
 | |
| 	* Fixed memory allocation bug in FLM tag vocabulary handling that could
 | |
| 	lead to crash when interpolating several FLMs.
 | |
| 	
 | |
| 	* Rewrote make-batch-counts scripts to
 | |
| 	  - avoid problems with limits on command line length
 | |
| 	  - support systems that don't have compressed file I/O.
 | |
| 	
 | |
| 	* Modified merge-batch-counts script to
 | |
| 	  - ensure that unmerged files are always merged in the next iteration,
 | |
| 	  to avoid file size imbalance (suggested by Alex Marin)
 | |
| 	  - support systems that don't have compressed file I/O.
 | |
| 
 | |
| 	* Fixed a portability issue with Intel icc version 7.0.
 | |
| 
 | |
| 	* compute-sclite fixed to invoke csrfilt.sh script with -t option.
 | |
| 
 | |
| 1.5.9	24 August 2009
 | |
| 
 | |
| 	Functionality:
 | |
| 	
 | |
| 	* Added ngram-count -text-has-weights option to scale counts on a 
 | |
| 	per-sentence basis.
 | |
| 
 | |
| 	* LMStats::countString() and NgramStats::countSentence() methods 
 | |
| 	generalized to take optional weight string argument (to support the
 | |
| 	above change).
 | |
| 
 | |
|  	* Added compile-time option to generate position-independent code 
 | |
| 	(make MAKE_PIC=yes, see INSTALL file).
 | |
| 
 | |
| 	* Added support for xz-compressed files (.xz files offer better
 | |
| 	compression than .gz at the expense of time and memory).
 | |
| 	The xz tool has to be installed separately (http://tukaani.org/xz).
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* wlat-to-pfsg generates NULL output labels for initial/final nodes
 | |
| 	with sentence start/end tags (because PFSGs encode those implicitly).
 | |
| 
 | |
| 	* TaggedVocab: check and report if number of tags/words exceeds max.
 | |
| 	Make number of bits allocated for tags/words proportional to 
 | |
| 	word size. Parse word/tag strings such that last (not the first)
 | |
| 	slash (/) character is treated as the delimiter.
 | |
| 
 | |
| 	* Documented the lattice-tool -ngrams-time-tolerance option that had
 | |
| 	been previously implemented but omitted from the man page.
 | |
| 
 | |
| 1.5.10	7 Jan 2010
 | |
| 	
 | |
| 	Functionality:
 | |
| 
 | |
| 	* New option ngram -float-counts to allow the -counts option to
 | |
| 	process fractional counts.
 | |
| 
 | |
| 	* The LM::pplCountsFile() and LM::countsProb() have been templatized
 | |
| 	(as a function of count type), and the TextStats class now uses double
 | |
| 	float counts, all in support of the above change.
 | |
| 
 | |
| 	* New option lattice-tool -word-posteriors-for-sentences for computing
 | |
| 	word posteriors based on confusion networks (contributed by Jing Zheng).
 | |
| 
 | |
| 	* lattice-tool now performs confusion network decoding and ngram 
 | |
| 	computation AFTER rescoring or expansion with LMs.  Therefore the two
 | |
| 	operations can be combined in a single run where previously two
 | |
| 	invocations were necessary.
 | |
| 
 | |
| 	* Added fsm-to-pfsg map_epsilon= option, to translate FSM <eps> symbols
 | |
| 	to another label.
 | |
| 
 | |
| 	* New script filter-event-counts to preprocess a count file for use 
 | |
| 	with ngram -counts .
 | |
| 
 | |
| 	* lattice-tool continues processing when one of the lattices specified
 | |
| 	with -in-lattice-list cannot be opened.
 | |
| 
 | |
| 	* Regression tests have been moved to module subdirectories
 | |
| 	(lm/test, flm/test, lattice/test) and can now be run from the 
 | |
| 	top-level with "make test".  Decompression of data files for platforms
 | |
| 	that don't support compressed file I/O is now automatic.
 | |
| 
 | |
| 	Documentation:
 | |
| 
 | |
| 	* Added new FAQ items covering handling of OOVs and zeroprob words,
 | |
| 	based on input from Nitin Madnani.
 | |
| 
 | |
| 	* Correction to the man page description of the ngram -count-order
 | |
| 	option:  It limits the maximal order of processed ngrams.
 | |
| 
 | |
| 	* Corrected and updated ordered list of processing steps in
 | |
| 	lattice-tool man page.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Use double precision to record log probs in TextStats object.
 | |
| 
 | |
| 	* Workaround for a deficiency in Intel's 7.00 C++ compiler.
 | |
| 
 | |
| 	* lattice-tool was not handling PFSG lattices in (1best or N-best) 
 | |
| 	decoding with a LM.
 | |
| 
 | |
| 	* lattice-tool will exit with a non-zero status if any of the lattice
 | |
| 	operations fail.
 | |
| 
 | |
| 	* Fixed some format string/argument mismatches that could bite on
 | |
| 	64-bit platforms.
 | |
| 
 | |
| 	* Updated usage of sort with key specification to conform to latest
 | |
| 	POSIX standard.  The old syntax was no longer working with recent
 | |
| 	GNU sort versions.
 | |
| 
 | |
| 1.5.11	16 June 2010
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* New program "maxalloc" to find the maximum amount of memory that
 | |
| 	can be allocated by a user process in the current environment.
 | |
| 	May be useful to debug out-of-memory conditions.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Avoid deleting low-posterior null tokens when aligning lattices into
 | |
| 	word meshes.
 | |
| 
 | |
| 	* Map explicit start/end-of-sentence tags in HTK lattices to null, 
 | |
| 	since they are already implicitly attached to the start/end nodes
 | |
| 	of the lattice (LM scoring gives anomalous results on repeated tags).
 | |
| 
 | |
| 	* option.[ch]: fixed declaration issues to avoid compiler warnings.
 | |
| 
 | |
| 	* Moved man page for the option library functions to misc/doc.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Fixes to compile cleanly with gcc -Wall -Wno-unused-variable
 | |
| 	-Wno-uninitialized.
 | |
| 	* Fixed a problem with gcc-4.4 compiles.
 | |
| 	* Fixed a problem with macro definition of fseeko() ftello().
 | |
| 	* Fixed a problem with the lm/ngram-count-wb-subset test, which could
 | |
| 	fail after the test data is uncompressed.
 | |
| 	* Use gzip -d to read gzipped files, avoids shell wrapper overhead.
 | |
| 
 | |
| 1.5.12  20 Jan 2011
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* Enable lattice-tool -old-decoding if -nbest-duplicates is specified
 | |
| 	(and warn about it).  
 | |
| 	* Support make-big-lm -wbdiscount option.
 | |
| 	* New option ngram -prune-history-lm, for specifying a separate LM that
 | |
| 	computes the history marginal probablities needed for N-gram pruning
 | |
| 	purposes.  Inspired by C. Chelba et al., "Study on Interaction Between
 | |
| 	Entropy Pruning and Kneser-Ney Smoothing", Proc. Interspeech-2010.
 | |
| 	* Added optional limitVocab argument to VocabMultiMap::read() function.
 | |
| 	This is now used by lattice-tool -limit-vocab to avoid reading parts of
 | |
| 	the dictionary that are not used in the input.
 | |
| 	* Added an option -zeroprob-word to ngram and lattice-tool. It
 | |
| 	specifies a word that should be used as a replacement if the current
 | |
| 	word has probability zero.  This is different from -map-unk which only
 | |
| 	applies to OOV words and actually replaces the word label in the output
 | |
| 	lattice, if any.
 | |
| 	* Added new wrapper LM class NonzeroLM, to implement the above.
 | |
| 
 | |
| 	Portability:
 | |
| 
 | |
| 	* New MACHINE_TYPE values for Android-ARM platform: android-armeabi and
 | |
| 	android-armeabi-v7a (from Mike Frandsen).
 | |
| 	* Deleted the htk directory from distribution; it was obsolete and not
 | |
| 	documented.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Prob.h: guard against under/overflow in intlog and bytelog
 | |
| 	conversions.
 | |
| 	* Replaced gunzip with gzip -d in all scripts (for efficiency).
 | |
| 	* Better option checking in make-big-lm, disallowing mixing of
 | |
| 	discounting methods and use of discounting flags that are not supported.
 | |
| 	* Undefine max() macro in Trellis.h to avoid conflict with some system
 | |
| 	header files.
 | |
| 	* Better support for recent MSVC versions in
 | |
| 	common/Makefile.machine.msvc (from Mile Frandsen).
 | |
| 	* add-pauses-to-pfsg: prevent existing pause nodes from being processed.
 | |
| 
 | |
| 1.6.0  8 December 2011
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* Added lattice-tool -loglinear-mix option.
 | |
| 	* Add platform-independent strtok_r() function, and replaced all
 | |
| 	instances of strtok().
 | |
| 	  Eventual goal is thread safety and re-entrance.
 | |
| 	* Modified File object to allow I/O to/from strings as well as files.
 | |
| 	* Modified code for reading and writing HTK lattices and NBest lists to
 | |
| 	enable I/O to/from strings as well as files, for in-memory processing.
 | |
| 	* Added special-purpose malloc/free implementation for SArray and LHash
 | |
| 	data structures, to reduce overhead for small allocation chunks.  Also
 | |
| 	added some allocation statistics reporting (enabled by ngram -memuse
 | |
| 	-debug 1).
 | |
| 	* Added the metadb config file lookup tool.
 | |
| 	* Cumulative binomial script (cumbin) command accepts optional 3rd
 | |
| 	argument to set p parameter.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 	
 | |
| 	* Correctly handle lattice-tool -use-server when generating nbest lists
 | |
| 	(server- based LM was previously ignored).
 | |
| 	* lattice-tool -split-multiwords no longer splits words appearing in 
 | |
| 	-ignore-vocab.
 | |
| 	* lattice-tool allowed to operate on HTK lattices containing unrecognized
 | |
| 	header fields (but warn about them).
 | |
| 	* Updated reference output for many build platforms to avoid spurious
 | |
| 	test failures.
 | |
| 	* Avoid abnormal backoff weights when lower-order probabilities sum to
 | |
| 	almost one.
 | |
| 	* Avoid test failures for merge-batch-counts and make-ngram-pfsg due to
 | |
| 	locale differences.
 | |
| 	* Fix maxalloc for 64bit systems where "long" is still 32 bits.
 | |
| 
 | |
| 	Building:
 | |
| 
 | |
| 	* Added Microsoft Visual Studio 2005 projects, see
 | |
| 	doc/README.windows-msvc-visual-studio for more information.
 | |
| 	* Added new Makefile targets superclean and pristine to return
 | |
| 	SRILM to pre-build state.
 | |
| 	* Add Makefiles for MACHINE_TYPE macosx-m32 and macosx-m64 to
 | |
| 	allow explicit 32- or 64-bit compilation on MacOS X 10.6.  Updated
 | |
| 	GAWK location to allow tests to succeed.
 | |
| 	* Replaced various C-shell helper scripts in sbin/ with Bourne-shell 
 | |
| 	versions, for greater portability.
 | |
| 	* New MACHINE_TYPE=msvc64 for 64bit builds with Visual Studio.
 | |
| 
 | |
| 	Documentation:
 | |
| 
 | |
| 	* Added doc/asru2011-srilm.pdf, a paper describing SRILM updates since
 | |
| 	2002.  Old ICSLP paper renamed to doc/icslp2002-srilm.pdf .
 | |
| 
 | |
| 1.7.0	23 December 2012
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* ngram -codebook option for reading of Ngram LMs with quantized parameters
 | |
| 	(contributed by Microsoft).
 | |
| 	* ngram -msweb-lm option for obtaining LM probabilities from the Microsoft
 | |
| 	Web N-gram service (web-ngram.research.microsoft.com). You need to obtain
 | |
| 	a user ID to use this service, see man ngram for details (contributed by
 | |
| 	Microsoft).
 | |
| 	* Added support for dictionary-induced word distance metrics to
 | |
| 	nbest-optimize (-dictionary option).
 | |
| 	* Added support for matrix-defined word distance metrics to
 | |
| 	nbest-optimize (-distances option).
 | |
| 	* ngram -debug 4 -ppl outputs ranking statistics (number of times correct 
 | |
| 	word was in top 1, 5, 10), as well as quadratic and absolute loss averages
 | |
| 	(based on code from Omid Madani).
 | |
| 	* nbest-optimize accepts n-best list in SRInterp format and generates
 | |
| 	SRInterp format rover-control file (weights file), when -srinterp-format
 | |
| 	is specified. 
 | |
| 	* nbest-optimize accepts SRInterp counts file that contains BLEU and TER
 | |
| 	counts info.
 | |
| 	* lattice-tool -read-mesh will try to preserve acoustic information
 | |
| 	(times, scores, pronunciations) if they are encoded in the input confusion
 | |
| 	network.
 | |
| 	* Support reading of text files in UTF-8 and UTF-16 encodings.  All string
 | |
| 	data is internally represented, and output, as ASCII/UTF-8 (contributed
 | |
| 	by Microsoft).
 | |
| 	This feature uses the iconv library.  Support for this feature can be
 | |
| 	disabled by compiling with "NO_ICONV=anything" on the make command line.
 | |
| 
 | |
| 	Portability:
 | |
| 
 | |
| 	* Ported LM client/server code to Winsock API (native socket library in
 | |
| 	Windows), enabling this functionality for mingw and MSVC platforms
 | |
| 	(contributed by Microsoft).
 | |
| 	* Let machine-type script return 64bit platform names for Linux and Solaris
 | |
| 	x86 when appropriate.  This implies that 64bit binaries are built by
 | |
| 	default on machines that support them.
 | |
| 	* Array.h tweak for clang compiler (from kutlak.roman@gmail.com).
 | |
| 	* Work around a namespace problem in C++11 (from kutlak.roman@gmail.com).
 | |
| 	* Use size_t for hash codes to ensure word width matches pointer type.
 | |
| 	* Fixes for mingw32 build, using Windows APIs for sockets and UTF
 | |
| 	conversion (contributed by Microsoft).
 | |
| 	* Support for 64bit mingw build (MACHINE_TYPE=win64).
 | |
| 	* Updates for MacOSX (MACHINE_TYPE=macosx, thanks to Chuck Wooters).
 | |
| 	* Deal with nonportability of isfinite() and isnan().
 | |
| 	* Changes for thread-safety (by Kyle McIntyre). See doc/README-THREADS
 | |
| 	for details.
 | |
| 	- Modified the remove() methods in various container classes to return 
 | |
| 	Boolean instead of a pointer to the removed element.  The removed element
 | |
| 	can be gotten with an optional reference argument. This eliminates the 
 | |
| 	need for a global static variable.
 | |
| 	- Use STL sort() instead of qsort() in LHash and SArray sorted iterations.
 | |
| 	- Replaced all static variables with thread-local storage via the TLSWrapper
 | |
| 	class, requiring the pthread library. This is available on most platforms, 
 | |
| 	but can be disabled at compile-time with -DNO_TLS.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* NgramLM backoff computation fixed to avoid spurious insertion of nonzero	
 | |
| 	unigram probabilities and non-unity backoff weights (resulting from
 | |
| 	numerator/denominator values below Prob_Epsilon).
 | |
| 	* lattice-tool does a better job inferring the lattice basename from the 
 | |
| 	UTTERANCE string embedded in HTK lattices.
 | |
| 	* Trellis class: use a secondary sorting criterion to make N-best output
 | |
| 	deterministic.
 | |
| 	* WordMesh class: use posterior word probability to decide which acoustic
 | |
| 	information to keep when merging hyps, instead of duration-normalized
 | |
| 	acoustic stores as before.  This leads to fewer words with out-of-order
 | |
| 	timestamps when extracting one-best from confusion networks.
 | |
| 	* fix-ctm script: Check for out-of-order word timestamps and adjust them
 | |
| 	minimally as needed to produce a monotonic sequence, as required for
 | |
| 	CTM sorting.
 | |
| 	* Fixed bug in NgramCountLM estimation procedure reported by ariya@jhu.edu.
 | |
| 	* Allow ngram -hidden-vocab to read hidden event properties described in
 | |
| 	man page.
 | |
| 	* Fixed bug in ngram -hidden-vocab -write-lm output.
 | |
| 	* Avoid crash when ngram -hidden-not -ppl is used with debug level 2.
 | |
| 	* Fixed (very rare) bug by which ngram -prune might remove all ngrams
 | |
| 	sharing a common context.
 | |
| 	* Improved ngram -prune-lowprobs by also removing backoff weights that
 | |
| 	have become useless (suggested by Arlo Faria).
 | |
| 	* Check for successful search for HTK lattice start/end nodes, if not
 | |
| 	explicitly specified (reported by nshmyrev@yandex.ru).
 | |
| 	* Handle infinity scores in lattice rescoring, and catch NaN scores when
 | |
| 	reading HTK lattices.
 | |
| 	* make-kn-discounts checks for negative discount values and reports
 | |
| 	error if appropriate.
 | |
| 	* nbest-optimize accepts combined BLEU and error rate objective via switch
 | |
| 	-error-bleu-ratio R (R specifies the error rate weight).
 | |
| 	* lattice-tool -timeout option now uses sigsetjmp/siglongjmp to handle
 | |
| 	timeout alarms.  This is necessary in Linux-compatible (including cygwin)
 | |
| 	systems to handle alarms repeatedly.
 | |
| 	* Fixed a bug reading NBestList2.0 format without phone information (led
 | |
| 	to malformed confusion network output).
 | |
| 	* Fixed a bug in Ngram::contextID() that was causing incorrect expansion
 | |
| 	of lattices with pruned backoff models.
 | |
| 	* Fixed a bug in the lattice-tool -keep-unk implementation that was
 | |
| 	sometimes allowing an OOV word label to be output as <unk>.
 | |
| 	* Removed some pseudo-randomness in ngram-class so that results are more
 | |
| 	invariant to OPTION setting and platform properties.
 | |
| 	* Avoid differences due to machine arithmetic in word mesh alignment,
 | |
| 	making confusion network building and posterior decoding more stable 
 | |
| 	across platforms.
 | |
| 	* Exclude metatags when writing out the vocabulary of binary Ngram LMs.
 | |
| 	* Fixed some missing dependencies in Visual Studio solution file.
 | |
| 
 | |
| 1.7.1	4 June 2014
 | |
| 
 | |
| 	* Updated INSTALL, Copyright.  Added ACKNOWLEDGEMENTS.
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* Integrated the maximum entropy extension by Tanel Alumae, described
 | |
| 	at http://www.phon.ioc.ee/~tanela/srilm-me/ .
 | |
| 	Please cite Tanel's paper (copied here in doc/is2010-maxent.pdf) if you
 | |
| 	use this functionality in your research.
 | |
| 	* Enable LM server to process multiple commands in a single message
 | |
| 	(separated by newlines).  This capability was never documented, but
 | |
| 	existed in the first implementation that used read/write system calls,
 | |
| 	but was lost when we switched to recv/send calls.
 | |
| 	* Generalized the BayesMix LM class to allow an arbitrary number of
 | |
| 	mixture components, similar to LoglinearMix.
 | |
| 	* Added the ngram -context-priors option to read context-dependent
 | |
| 	mixture weight priors from a file.
 | |
| 	* Added the ngram -read-mix-lms option to read the list of interpolated
 | |
| 	LMs, weights and options from a file, specified by the -lm option.
 | |
| 	* Use zlib for I/O from/to gzipped files. Benefits are: (a) works with
 | |
| 	native Windows binaries, (b) avoids subprocess, (c) allows reading
 | |
| 	(though still not writing) of gzipped binary LM and count files.
 | |
| 	* ngram-count -gtNmin options accept floating point values for more
 | |
| 	flexibility with LM estimation from fractional counts.
 | |
| 	* Added lattice-tool -set-lattice-names option to preserve input 
 | |
| 	filenames inside lattices.
 | |
| 	* New script replace-unk-words, for replacing OOV words relative to
 | |
| 	a vocabulary with <unk> tag.
 | |
| 	* Added new lattice-tool options -hyp-list -hyp-file -hyp2-list
 | |
| 	-hyp2-file -add-hyps to add ASR hypotheses into word mesh (confusion
 | |
| 	network). The added options are similar to -ref-list -ref-file -add-refs,
 | |
| 	except that the added hypothesized words will not be indicated as 
 | |
| 	reference words in the word mesh.
 | |
| 	* Added a function in WordMesh to compute slot-to-slot alignment
 | |
| 	between two confusion networks.
 | |
| 	* Added ngram-class option to limit number of words per class (from
 | |
| 	seppo.enarvi@aalto.fi).
 | |
| 
 | |
| 	Portability:
 | |
| 
 | |
| 	* Added support for 64bit cygwin builds (MACHINE_TYPE=cygwin64).
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* ngram -rescore-ngram was not setting the handling of special word
 | |
| 	tokens (<s>, </s>) if the rescored LM was being evaluated in the same
 | |
| 	run.
 | |
| 	* ngram-count -skip needs to read counts one order higher than specified
 | |
| 	by -order .
 | |
| 	* SkipNgram will now try to reestimate the discounting parameters from 
 | |
| 	expected counts on each EM iteration (but fall back on initial parameters
 | |
| 	if that fails, e.g., for discounting methods that cannot handle float
 | |
| 	counts).
 | |
| 	* SubVocab instances' handling of metatags and nonevent words is now
 | |
| 	tied to the base Vocab instance.
 | |
| 	* Avoid anomalies in random word generation due to nonzero probabilities
 | |
| 	for nonwords.
 | |
| 	* Cleaned-up select-vocab script from Anand Venkataraman.  Now works
 | |
| 	with perl 5.12 and gives consistent results on different platforms.
 | |
| 	Added a test case.
 | |
| 	* Fixed removeTrie() bug that was leading to memory leak in Ngram 
 | |
| 	destructor.
 | |
| 	* Fixed bug in LHash iterator that lead to potential double enumeration
 | |
| 	of items after deletions, and could affect Ngram pruning results.
 | |
| 	* Allow number of ngrams in ARPA LM to exceed 2^31. (Vocabulary size
 | |
| 	is still limited to 2^32.)
 | |
| 	* Initialize key and data objects in SArray and LHash containers after 
 | |
| 	allocation.
 | |
| 	* Pass Trellis state parameters by reference to avoid copying of
 | |
| 	potentially complex objects.
 | |
| 	* Fixed memory access error in Ngram::clear() for order-1 models.
 | |
| 	* Fixed a problem handling null string states in Trellis.
 | |
| 	* Fix to preserve double precision in NBest acoustic and LM scores.
 | |
| 	* Fixed an error concerning the use of -gtNmin options in the srilm-faq(7)
 | |
| 	man page pointed out by dugast@systran.fr.
 | |
|         * If a lattice-tool input lattice is a word mesh, avoid calling
 | |
| 	alignLattice() since the input is already a word mesh.
 | |
| 	* Fixes to reading/writing of quantization codebook files.
 | |
| 	* Fixed header comment and test program for Map2::remove().
 | |
| 
 | |
| 1.7.2	9 November 2016
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* Added interfaces to Lattice and WordMesh that allows external programs 
 | |
| 	to map sausage nodes to their original lattice nodes.
 | |
| 	* New VocabDistance subclass StemDistance, comparing words only based on
 | |
| 	their stems.
 | |
| 	* New lattice-tool option -stem-dist triggers StemDistance use in
 | |
| 	confusion network alignments, including -add-hyps and -add-refs processing.
 | |
| 	* Add optional support for keyword spotting (in Lattice.h and
 | |
|           LatticeIndex.cc) when writing a 1-gram index.
 | |
|   	* Added new File field NBestOptions::nbestRttm2, if it exists then write
 | |
|           (an approximation to) the NBestList2.0 format output.
 | |
| 	* Added simple Trellis pruning based on relative thresholding of forward
 | |
| 	probabilities (Trellis::prune()).
 | |
| 	* make-big-lm now understands the -ukndiscount option. The make-kn-discounts
 | |
| 	helper script has an option to compute unmodified KN discounts.
 | |
| 	* The -version option now reports the compiler version used.
 | |
| 	* Added ngram-count -write-text option to test conversion of UTF-16 files
 | |
| 	to ASCII/UTF-8.
 | |
| 	* Added ngram -text-has-weights option to allow weighting sentences in ppl
 | |
| 	computation.
 | |
| 	* Added scripts nbest-words and compute-sclite-nbest for conveniently
 | |
| 	computing nbest-optimize -errors information using sclite.
 | |
| 	* Added the nbest-optimize -xval-files option to support cross-validation.
 | |
| 	* Added script search-rover-combo for searching for best combination among
 | |
| 	a list of systems.
 | |
| 	* Added confidence value fields to NBestWordInfo class.
 | |
| 	* Added check to compute-best-mix to warn about word label mismatches between
 | |
| 	input files.
 | |
| 
 | |
| 	Portability:
 | |
| 
 | |
| 	* Honor TMPDIR environment variable in various scripts.
 | |
| 	* Miscellarous MacosX fixes. 
 | |
| 	* Include BSD rand48 functions so that random sentence generation gives same
 | |
| 	result on all platforms.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Avoid leaky backoff by mapping very small probability sums to 0 in BOW
 | |
| 	computation.  Otherwise unseen ngrams may end up with nonzero probabilties
 | |
| 	in unsmoothed LMs.
 | |
| 	* Fixed compare-ppls compute-best-mix compute-best-sentence-mix ppl-from-log
 | |
| 	to recognize the MSVC representation of -infinity.
 | |
| 	* Fixed a bug in the handling of zero prefix probabilities in ClassNgram,
 | |
| 	HiddenNgram and HMMofNgrams.
 | |
| 	* Fixed a memory allocation bug that caused the ngram-count-maxent test
 | |
| 	to crash.
 | |
| 	* Fixes to lattice-tool rttm nbest output.
 | |
| 	* Fix for possible endless loop in lattice-tool -posterior-prune due to 
 | |
| 	limited float precision (from Seppo Enarvi).
 | |
| 	 * Fixed a problem with declaration of Map_nokeyP() that takes reference
 | |
| 	arguments and were missing "const"; was causing crash in segment tool.
 | |
| 	* Workaround for what looks like an optimizer bug in gcc >= 4.9 that can
 | |
| 	cause ngram -prune to core dump.
 | |
| 	* Output TextStats quantities (sentence/word counts, log probs, perplexities),
 | |
| 	model parameters, nbest and lattices scores, and other quantities with full
 | |
| 	precision so as to avoid loss of information.
 | |
| 	* nbest-optimize -1best now outputs a rover-control file that simulates
 | |
| 	Viterbi decoding (by using a small posterior scale).
 | |
| 	* nbest-optimize -errrors now tolerates varying number of reference words
 | |
| 	for the same sentence.  This can arise from sclite references with alternate
 | |
| 	words strings.
 | |
| 	* Fixed a stupid bug in uniform-classes.gawk script.
 | |
| 	* Allow combine-rover-controls to merge control files with the same systems 
 | |
| 	in them, adding their weights.
 | |
| 	* Updated zlib to version 1.2.8.  This fixes a bug whereby gzipped output files
 | |
| 	could end up with zero size (instead of a legal gzipped file that results in a 
 | |
| 	zero-length file when decompressed).
 | |
| 
 | |
| 1.7.3.	9 September 2019
 | |
| 
 | |
| 	Functionality:
 | |
| 
 | |
| 	* Added nbest-oov-counts script to generate OOV counts for nbest hypotheses.
 | |
| 	* Added a simple mechanism for weight tying in nbest-rover control files.  A 
 | |
| 	system weight of = indicates that it should be tied to the previously listed
 | |
| 	system.  This is useful for reducing the number of free parameters when 
 | |
| 	searching for good system combinations (search-rover-combo).
 | |
| 	* Add Map_noKey() and Map_noKeyP() for unsigned long long type, to enable use
 | |
| 	with size_t on Windows MSVC.
 | |
| 	* Output from -version now includes compile-time options.
 | |
| 	* Added option ngram -minbackoff to fix up models that have unnormalized 
 | |
| 	probabilities or that are not smoothed.
 | |
| 	* Added option ngram -unk-probs to override unknown word probabilities.
 | |
| 	* Added nbest-optimize-args-from-rover-control script, convenient for 
 | |
| 	extracting initialization parameters for nbest-optimize from existing
 | |
| 	nbest-rover control file.
 | |
| 	* Added ngram-count -text-has-weights-last option to allow text input with 
 | |
| 	count values at ends of lines.
 | |
| 	 * Added nbest-rover -missing-nbest option to treat missing nbest lists as if
 | |
|         an empty hypothesis (no words) had been output, rather than simply skipping
 | |
|         that nbest list.
 | |
| 	* Added nbest-lattice -time-penalty option, implementing a soft constraint
 | |
| 	on time stamps (when present) during confusion network building and alignment.
 | |
| 	* Added nbest-lattice -average-times option, to average word times instead
 | |
| 	of picking the timing of the highest posterior hypothesis.
 | |
| 	* Added nbest-lattice -suppress-vocab option to disallow certain words in 
 | |
| 	posterior decoding.
 | |
| 	* New scripts concat-sausages for chaining word confusion networks together.
 | |
| 	* Added nbest-lattice -dump-lattice-alignments option to output mappings
 | |
| 	between sausage positions and alignment costs.
 | |
| 	* Updated Android build for 64-bit development for armv8 using NDK r20 and clang.  
 | |
| 	This almost certainly breaks the 32-bit build for armv7.  The last known good 32-bit
 | |
| 	build is in common/Makefile.core.android.r11c, last built using NDK r11c.  To use this,
 | |
| 	copy Makefile.core.android.r11c to Makefile.core.android.  See doc/README.android.
 | |
| 
 | |
| 	Bug fixes:
 | |
| 
 | |
| 	* Added a new tool nbest-rover-helper that combines the functions of the
 | |
| 	combine-acoustic-scores and nbest-posteriors scripts, doing these computations
 | |
| 	in double precision and faster. nbest-rover now uses this tool (except when
 | |
| 	certain options like -nbest-backtrace are used).
 | |
| 	* nbest-rover strips DOS end-of-line CR characters from the control file, so
 | |
| 	they no longer mess up the parsing of the file.
 | |
| 	* Rationalize the way ties are broken when decoding word confusion networks.
 | |
| 	The word with the lowest internal index is now preferred (and the *DELETE* token
 | |
| 	always comes before all other words), unless the new nbest-lattice option
 | |
| 	-random-tie-break is given.  The output order of alternative word hypotheses
 | |
| 	to sausage files is always by probability rank first, then by internal index.
 | |
| 	* The reverse-ngram-counts script now replaces <s> with </s> and vice-versa,
 | |
| 	as required for training reverse-direction LMs, and consistent with reverse-text.
 | |
| 	* Handle comment lines starting with '##' and empty lines in nbest-rover control
 | |
| 	files the same way as in File::getline(), i.e., ignore them.
 | |
| 	* Fixed the syntax for the nbest-optimize -dynamic-random-series options (now 
 | |
| 	starts with single dash, as described in man page).
 | |
| 	* Don't let compute-best-mix complain about word mismatches if <unk> is involved.
 | |
| 	* Cast input to isspace() to (unsigned char) to guarantee input is non-negative.
 | |
| 	* Fixed memory management problems in MEModel.
 | |
| 	* Work around a bug in zlib's gzprintf() printing of very long %s arguments; was
 | |
| 	causing long word strings not to be output into .gz files.
 | |
| 	* Removed word string length limit.
 | |
| 	* Removed limit on total line length in outputting ngram count files.
 | |
| 	* Zlib updated to version 1.2.11.
 | |
| 	* nbest-posteriors ensures that bytelog scores are output in fixed-point format.
 | |
| 	* Allow floating point values when parsing bytelog scores in nbest lists.
 | |
| 	* Most robustness to word sausages input files that have missing data for some
 | |
| 	position.
 | |
| 	* Fixed a performance bug when nbest-rover is invoked with -output-ctm option.
 | |
| 
 | |
| $Date: 2019/09/09 23:09:32 $
 | |
| 
 | |
| 
 | 
