b2txt25/language_model/srilm-1.7.3/doc/overview


NEW SRI-LM LIBRARY AND TOOLS -- OVERVIEW

Design Goals

- coverage of state-of-the-art LM methods
- extensible vehicle for LM research
- code reusability
- tool versatility
- speed

Implementation language: C++ (GNU compiler)


LM CLASS HIERARCHY

   LM
     Ngram			-- arbitrary-order N-gram backoff models
	  DFNgram		-- N-grams including disfluency model
	  VarNgram		-- variable-order N-grams
          TaggedNgram		-- word/tag N-grams
     CacheLM			-- unigram from recent history
     DynamicLM			-- changes as a function of external info
     BayesMix			-- mixture of LMs with contextual adaptation


OTHER CLASSES

   Vocab			-- word string/index mapping
      TaggedVocab		-- same for word/tag pairs

   LMStats			-- statistics for LM estimation
      NgramStats		-- N-gram counts
          TaggedNgramStats	-- word/tag N-gram counts

   Discount			-- backoff probality discounting
      GoodTuring		-- standard
      ConstDiscount		-- Ney's method
      NaturalDiscount		-- Ristad's method


HELPER LIBRARIES

   libdstruct			-- template data structures

      Array			-- self-extending arrays

      Map
         SArray			-- sorted arrays
         LHash			-- linear hash tables

      Trie			-- index trees (based on a Map type)

      MemStats			-- memory usage tracking


   libmisc			-- convenience functions:
					option parsing,
					compressed file i/o,
					object debugging


TOOLS

   ngram-count			-- N-gram counting and model estimation
   ngram-merge			-- N-gram count merging
   ngram			-- N-gram model scoring, perplexity,
				   sentence generation, mixing and
				   interpolation


LM INTERFACE

LogP wordProb(VocabIndex word, const VocabIndex *context)
LogP wordProb(VocabString word, const VocabString *context)

LogP sentenceProb(const VocabIndex *sentence, TextStats &stats)
LogP sentenceProb(const VocabString *sentence, TextStats &stats)

unsigned pplFile(File &file, TextStats &stats, const char *escapeString = 0)
setState(const char *state);

wordProbSum(const VocabIndex *context)

VocabIndex generateWord(const VocabIndex *context)
VocabIndex *generateSentence(unsigned maxWords,  VocabIndex *sentence)
VocabString *generateSentence(unsigned maxWords, VocabString *sentence)

Boolean isNonWord(VocabIndex word)
Boolean read(File &file);
void write(File &file);


EXTENSIBILITY/REUSABILITY


THINGS TO DO

- Node array interface
- General interpolated LMs
- LM "shell" for interactive model manipulation and use (Tcl based)