competition update

This commit is contained in:
nckcard
2025-07-02 12:18:09 -07:00
parent 9e17716a4a
commit 77dbcf868f
2615 changed files with 1648116 additions and 125 deletions

View File

@@ -0,0 +1,12 @@
This file has been superceded by the "srilm-faq" manual page.
View with
man -M$SRILM/man srilm-faq
or online at
http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.html
Please send corrections and additions to the author.

View File

@@ -0,0 +1,21 @@
Conversion between Decipher PFSG (probabilistic finite state grammars) and
AT&T FSA (finite-state acceptor) formats
1. Convert PFSG to FSA format
pfsg-to-fsm symbolfile=foo.syms foo.pfsg > foo.fsm
2. Compile FSA
fsmcompile foo.fsm > foo.fsmc
3. Operate on FSA, e.g., determinize and minimize
fsmdeterminize foo.fsmc | fsmminimize > foo.min.fsmc
4. Print and convert back to PFSG
fsmprint -i foo.syms foo.min.fsmc | \
fsm-to-pfsg > foo.min.pfsg

View File

@@ -0,0 +1,42 @@
The default Linux Makefile (common/Makefile.machine.i686) uses /usr/bin/gawk
as the gawk interpreter. To ensure correct working in UTF-8 locales run
/usr/bin/gawk --version and make sure it is at least version 3.1.5.
If not, it is recommended to install the latest version of gawk in
/usr/local/bin and update the GAWK variable accordingly.
-----------------------------------------------------------------------------
From: Alex Franz <alex@google.com>
Date: Fri, 06 Oct 2000 16:14:03 PDT
And, here are the details of the malloc problem that I had with the SRI
LM toolkit:
I compiled it with gcc under Redhat Linux V. 6.2 (or thereabouts).
The malloc routine has problems allocating large numbers of
small pieces of memory. For me, it usually refuses to allocate
any more memory once it has allocated about 600 MBytes of memory,
even though the machine has 2 GBytes of real memory.
This causes a problem when you are trying to build language models
with large vocabularies. Even though I used -DUSE_SARRAY_TRIE -DUSE_SARRAY
to use arrays instead of hash tables, it ran out of memory when I was
trying to use very large vocabulary sizes.
The solution that worked for me was to use Wolfram Gloger's ptmalloc package
for memory management instead. You can download it from
http://malloc.de/en/index.html
(The page suggests that it is part of the Gnu C library, but I had to
compile it myself and explicitly link it with the executables.)
One more thing you can do is call the function
mallopt(M_MMAP_MAX, n);
with a sufficiently large n; this tells malloc to allow you to
obtain a large amount of memory.
------------------------------------------------------------------------------

View File

@@ -0,0 +1,50 @@
SRILM version 1.3 and higher has been successfully built and tested using
the CYGWIN environment (http://www.cygwin.com) as of Feb 11, 2002.
The test used CYGWIN DLL 1.3.9 and gcc 2.95.3 (but note the warning
in doc/README.x86 regarding a comiler bug), and ran successfully
on a Windows 98 and a Windows 2000 Professional system.
The following special measures were taken to ensure a successful build:
- Make sure the make, gcc, binutils, libiconv, gzip, tcltk, and gawk packages
are installed with CYGWIN. To run the tests you will also need the
diffutils and time packages.
After installation, set your bash environment as follows
export SRILM=/cygdrive/c/srilm13 # or similar
# do NOT use backslash in path names SRILM=C:\...
export MACHINE_TYPE=cygwin
export PATH=$PATH:$SRILM/bin:$SRILM/bin/cygwin # mentioned in INSTALL
export MANPATH=$MANPATH:$SRILM/man # mentioned in INSTALL
or the equivalent for other shells.
As of version 1.4.5, SRILM can also be built in the MinGW environment
(http://www.mingw.org). For this the default (cygwin) has to be overridden
using
make MACHINE_TYPE=win32
For 64bit binaries use
make MACHINE_TYPE=win64
Of course the corresponding versions of the MinGW development environment
(C, C++, binutils) have to be installed in Cygwin. Make sure the Cygwin
installation is generally up-to-date.
It may be necessary to include the following directories in the PATH
environment variable for the runtime dynamic libraries to be found:
/usr/i686-pc-mingw32/sys-root/mingw/bin (win32)
/usr/x86_64-w64-mingw32/sys-root/mingw/bin (win64)
Some functionality is not supported under MinGW:
- compressed file I/O
- nbest-optimize and lattice-tool -max-time option
A. Stolcke
$Id: README.windows-cygwin,v 1.10 2013/01/31 18:03:33 stolcke Exp $

View File

@@ -0,0 +1,63 @@
Recommendations for compiling with Microsoft Visual C++
The build procedure has been tested with the freely available
Visual C++ 8 that can be downloaded from www.microsoft.com as
"Visual C++ 2005 Express Edition".
0) Install the cygwin environment, as described in README.windows-cygwin .
Cygwin tools are needed to run the build process and generate program
dependencies.
1) SRILM can be set to the cygwin path of the SRILM root directory
(e.g., /home/username/srilm)
2) Make sure environment variables are set to locate MSVC tools and files:
PATH should include MSVC_INSTALL_DIR/bin and
MSVC_INSTALL_DIR/Common7/IDE (for dll search)
MSVCDIR should be set to MSVC_INSTALL_DIR
INCLUDE should be set to MSVC_INSTALL_DIR/include
LIB should be set to MSVC_INSTALL_DIR/lib
Note: PATH needs to use cygwin pathname conventions, but MSVCDIR,
INCLUDE and LIB must use Windows paths. For example:
PATH="/cygdrive/c/Program Files/Microsoft Visual Studio 8/VC/bin:/cygdrive/c/Program Files/Microsoft Visual Studio 8/Common7/IDE:$PATH"
MSVCDIR="c:\\Program Files\\Microsoft Visual Studio 8\\VC"
INCLUDE="$MSVCDIR\\include"
LIB="$MSVCDIR\\lib"
export PATH MSVCDIR INCLUDE LIB
could be used in bash given the default installation location of Visual
C++ 2005 Express Edition under c:\Program Files\Microsoft Visual Studio 8.
Alternatively, you could use the vcvars32.bat script that comes with
MSVC to set these environment variables.
3) Build in a cygwin shell with
make MACHINE_TYPE=msvc
or
make MACHINE_TYPE=msvc64
to generate 64bit binaries.
As with MinGW, some functionality is not supported:
- compressed file I/O other than gzip files
- nbest-optimize and lattice-tool -max-time option
Also note that make will try to determine if certain libraries
are installed on your system and enable the /openmp option if so.
This means that binaries built with the full Visual Studio compiler
might not run on systems that have only Visual Studio Express.
To avoid this disable /openmp by commenting out the corresponding
line containing "/openmp" in common/Makefile.machine.msvc.
4) Run test suite with
cd test
make MACHINE_TYPE=msvc try

View File

@@ -0,0 +1,71 @@
Recommendations for compiling with Microsoft Visual Studio 2005.
Microsoft Visual Studio projects are available for Microsoft Visual
Studio 2005.
The advantages of building directly in Visual Studio are:
- Build is faster.
- Integrated debugging.
- Library projects can be included as dependencies of your own projects.
The disadvantages of building directly in Visual Studio are:
- Not all parts of the SRILM build process can be completed inside Visual Studio.
- More cumbersome to build from command line.
The build procedure has been tested with the Visual Studio 2005
Professional.
0) Open SRILM/visual_studio/vs2005/srilm.sln in Visual Studio 2005.
1) Select either the "Release" or "Debug" targets.
2) Build the solution. The libraries will be built into
SRILM/lib/Release or SRILM/lib/Debug, and the executables will be built
into SRILM/bin/Release or SRILM/bin/Debug.
This procedure will NOT produce the central include directory, release
the SRILM scripts, or build the test environment.
To produce the central include directory or release the SRILM scripts,
you will have install the cygwin environment and use the included
Makefiles as described in (3):
(3) Follow steps (0) and (1) from README.windows-msvc:
(3a) Install the cygwin environment, as described in README.windows-cygwin.
Cygwin tools are needed to run the build process and generate program
dependencies.
(3b) SRILM can be set to the cygwin path of the SRILM root directory
(e.g., /home/username/srilm, /cygdrive/c/SRILM)
(3c) To produce the central include directory and release the SRILM
scripts, do either:
% make MACHINE_TYPE=msvc dirs
% make MACHINE_TYPE=msvc init
% make MACHINE_TYPE=msvc release-headers
% make MACHINE_TYPE=msvc release-scripts
% cd utils/src; make MACHINE_TYPE=msvc release
or
% make MACHINE_TYPE=msvc msvc
If you want to run the SRILM tests, you will have to follow the full
instructions in README.windows-msvc.
If you want to compile using a different version of MSVC, I suggest
creating a new subdirectory of SRILM/visual_studio, e.g.,
"SRILM/visual_studio/vs2010", and copying the contents of the "vs2005"
directory into your new directory. Then use MSVC to update your
projects. If you need to downgrade to MSVC 2003, it is possible to do
this by manually editing the project files.
I would like to thank Keith Vertanen (http://www.keithv.com) for
contributing the original versions of these Microsoft Visual Studio
projects to SRILM.
Victor Abrash
victor@speech.sri.com

View File

@@ -0,0 +1,21 @@
Note for Intel x86 platforms:
The function KneserNey::lowerOrderWeight() seems to trigger a compiler
bug in gcc 2.95.3 with optimization, on the i686-pc-linux-gnu,
i386-pc-solaris2, and i686-pc-cygwin platforms (and therefore probably
on all Intel targets). The problem manifests itself by the
"ngram-count-kn-int" test in the test/ directory not terminating.
To work around this problem, compile lm/src/Discount.cc without global
optimization:
cd $SRILM/lm/src
rm ../obj/$MACHINE_TYPE/Discount.o
gnumake OPTIMIZE_FLAGS=-O1 ngram-count
gnumake release
As of Feb 2002, Cygwin ships with gcc 2.95.3 and therefore suffers from
this bug. gcc 2.95.2 or lower and gcc 3.x versions of the compiler
don't seem to be affected, though.

View File

@@ -0,0 +1,9 @@
compute local entropies, stddev in perplexity output
implement TaggedNgram backoff-weight computation
fix some cases of back-to-back DFs
FP-REP2
general mechanism to skip words in ngram contexts, handle non-event tags.

View File

@@ -0,0 +1,47 @@
C++ porting notes
-----------------
I originally wrote this code using gcc/g++ as the compiler.
Below is a list of changes I had to make to accomodate SGI, Sun and Intel C++
compilers.
o Explicitly instantiate templates only in g++ (#ifdef __GNUG__).
o Avoid global static data referenced by template code. These symbols
become undefined in the instantiated template functions.
I use local static data, or global extern data instead (the latter
#ifndef __GNUG__).
o Avoid non-static inline functions in templates (.cc files). They
don't get properly instantiated by Sun CC and result in undefined
symbols.
o Use Sun CC -xar option to build libraries. This includes needed
template instances in the library (see the new $(ARCHIVE) macro to
refer to the appropriate archive command).
o To work around an internal compiler error in gcc 2.7.2, I had to
add empty destructors in a few classes. Howver, I no longer use gcc 2.7.2
in testing, so current version are no longer guaranteed to work with it.
o Intel C++ doesn't support arrays of non-POD objects on the stack. I had
to replace those with matching calls to new/delete.
Compilers that work
-------------------
(as of last checking, meaning current versions may not work out of the box)
+ gcc/g++ 2.8.1, 2.95.3, 3.2 and 3.3
+ Sun SC4.0
+ SGI NCC
+ Intel C++ 7.1
Compilers that don't work on this code
--------------------------------------
- Sun SC3.0.1
- CenterLine (Sun version, as of 6/96)

View File

@@ -0,0 +1,184 @@
SRI Object-oriented Language Modeling Libraries and Tools
BUILD AND INSTALL
See the INSTALL file in the top-level directory.
FILES
All mixed-case .cc and .h files contain C++ code for the liboolm
library.
All lower-case .cc files in this directory correspond to executable commands.
Each program is option driven. The -help option prints a list of
available options and their meaning.
N-GRAM MODELS
Here we only cover arbitary-order n-grams (no classes yet, sorry).
ngram-count manipulates n-gram counts and
estimates backoff models
ngram-merge merges count files (only needed for large
corpora/vocabularies)
ngram computes probabilities given a backoff model
(also does mixing of backoff models)
Below are some typical command lines.
NGRAM COUNT MANIPULATION
ngram-count -order 4 -text corpus -write corpus.ngrams
Counts all n-grams up to length 4 in corpus and writes them to
corpus.ngrams. The default n-gram order (if not specified) is 3.
ngram-count -vocab 30k.vocab -order 4 -text corpus -write corpus.ngrams
Same, but restricts the vocabulary to what is listed in 30k.vocab
(one word per line). Other words are replaced with the <unk> token.
ngram-count -read corpus.ngrams -vocab 20k.vocab -write-order 3 \
-write corpus.20k.3grams
Reads the counts from corpus.ngrams, maps them to the indicated
vocabulary, and writes only the trigrams back to a file.
ngram-count -text corpus -write1 corpus.1grams -write2 corpus.2grams \
-write3 corpus.3grams
writes unigrams, bigrams, and trigrams to separate files, in one
pass. Usually there is no reason to keep these separate, which is
why by default ngrams of all lengths are written out together.
The -recompute flag will regenerate the lower-order counts from the
highest order one by summation by prefixes.
The -read and -text options are additive, and can be used to merge
new counts with old. Furthermore, repeated n-grams read with -read
are also additive. Thus,
cat counts1 counts2 ... | ngram-count -read -
will merge the counts from counts1, counts2, ...
All file reading and writing uses the zio routines, so argument "-"
stands for stdin/stdout, and .Z and .gz files are handled correctly.
For very large count files (due to large corpus/vocabulary) this
method of merging counts in-memory is not suitable. Alternatively,
counts can sorted and then merged.
First, generate counts for portions of the training corpus and
save them with -sort:
ngram-count -text part1.text -sort -write part1.ngrams.Z
ngram-count -text part2.text -sort -write part2.ngrams.Z
Then combine these with
ngram-merge part?.ngrams.Z > all.ngrams.Z
(Although ngram-merge can deal with any number of files >= 2 ,
it is most efficient to combined two parts at a time, then
from the resulting new count files, again two at a time, using
a binary tree merging scheme.)
BACKOFF MODEL ESTIMATION
ngram-count -order 2 -read corpus.counts -lm corpus.bo
generates a bigram backoff model in ARPA (aka Doug Paul) format
from corpus.counts and writes is to corpus.bo.
If counts fit into memory (and hence there is no reason for the
merging schemes described above) it is more convenient and faster
to go directly from training text to model:
ngram-count -text corpus -lm corpus.bo
The built-in discounting method used in building backoff models
is Good Turing. The lower exclusion cutoffs can be set with
options (-gt1min ... -gt6min), the upper discounting cutoffs are
selected with -gt1max ... -gt6max. Reasonable defaults are
provided that can be displayed as part of the -help output.
When using limited vocabularies it is recommended to compute the
discount coeffiecients on the unlimited vocabulary (at least for
the unigrams) and then apply them to the limited vocabulary
(otherwise the vocabulary truncation would produce badly skewed
counts frequencies at the low end that would break the GT algorithm.)
For this reason, discounting parameters can be saved to files and
read back in.
For example
ngram-count -text corpus -gt1 gt1.params -gt2 gt2.params \
-gt3 gt3.params
saves the discounting parameters for unigrams, bigrams and trigams
in the files as indicated. These are short files that can be
edited, e.g., to adjust the lower and upper discounting cutoffs.
Then the limited vocabulary backoff model is estimated using these
saved parameters
ngram-count -text corpus -vocab 20k.vocab \
-gt1 gt1.params -gt2 gt2.params gt3.params -lm corpus.20k.bo
MODEL EVALUATION
The ngram program uses a backoff model to compute probabilities and
perplexity on test data.
ngram -lm some.bo -ppl test.corpus
computes the perplexity on test.corpus according to model some.bo.
The flag -debug controls the amount of information output:
-debug 0 only overall statistics
-debug 1 statistics for each test sentence
-debug 2 probabilities for each word
-debug 3 verify that word probabilities over the
entire vocabulary sum to 1 for each context
ngram also understands the -order flag to set the maximum ngram
order effectively used by the model. The default is 3.
It has to be explicitly reset to use ngrams of higher order, even
if the file specified with -lm contains higher order ngrams.
The flag -skipoovs establishes compatibility with broken behavior
in some old software. It should only be used with bo model files
produced with the old tools. It will
- let OOVs be counted as such even when the model has a probability
for <unk>
- skip not just the OOV but the entire n-gram context in which any
OOVs occur (instead of backing off on OOV contexts).
OTHER MODEL OPERATIONS
ngram performs a few other operations on backoff models.
ngram -lm bo1 -mix-lm bo2 -lambda 0.2 -write-lm bo3
produces a new model in bo3 that is the interpolation of bo1 and bo2
with a weight of 0.2 (for bo1).
ngram -lm bo -renorm -write-lm bo.new
recomputes the backoff weights in the model bo (thus normalizing
probabilities to 1) and leaves the result in bo.new.
API FOR LANGUAGE MODELS
These programs are just examples of how to use the object-oriented
language model library currently under construction. To use the API
one would have to read the various .h files and how the interfaces
are used in the example progams. No comprehensive documentation is
available as yet. Sorry.
AVAILABILITY
This code is Copyright SRI International, but is available free of
charge for non-profit use. See the License file in the top-level
direcory for the terms of use.
Andreas Stolcke
$Date: 1999/07/31 18:48:33 $

View File

@@ -0,0 +1,12 @@
size of ngram reading Switchboard trigram
SunOS5.5.1 libc malloc 12M
-lmalloc 12M
IRIX5.1 libc malloc 12M
-lmalloc MX_FAST = 24 11M
MX_FAST = 10 12M
MX_FAST = 32 11M

View File

@@ -0,0 +1,115 @@
NEW SRI-LM LIBRARY AND TOOLS -- OVERVIEW
Design Goals
- coverage of state-of-the-art LM methods
- extensible vehicle for LM research
- code reusability
- tool versatility
- speed
Implementation language: C++ (GNU compiler)
LM CLASS HIERARCHY
LM
Ngram -- arbitrary-order N-gram backoff models
DFNgram -- N-grams including disfluency model
VarNgram -- variable-order N-grams
TaggedNgram -- word/tag N-grams
CacheLM -- unigram from recent history
DynamicLM -- changes as a function of external info
BayesMix -- mixture of LMs with contextual adaptation
OTHER CLASSES
Vocab -- word string/index mapping
TaggedVocab -- same for word/tag pairs
LMStats -- statistics for LM estimation
NgramStats -- N-gram counts
TaggedNgramStats -- word/tag N-gram counts
Discount -- backoff probality discounting
GoodTuring -- standard
ConstDiscount -- Ney's method
NaturalDiscount -- Ristad's method
HELPER LIBRARIES
libdstruct -- template data structures
Array -- self-extending arrays
Map
SArray -- sorted arrays
LHash -- linear hash tables
Trie -- index trees (based on a Map type)
MemStats -- memory usage tracking
libmisc -- convenience functions:
option parsing,
compressed file i/o,
object debugging
TOOLS
ngram-count -- N-gram counting and model estimation
ngram-merge -- N-gram count merging
ngram -- N-gram model scoring, perplexity,
sentence generation, mixing and
interpolation
LM INTERFACE
LogP wordProb(VocabIndex word, const VocabIndex *context)
LogP wordProb(VocabString word, const VocabString *context)
LogP sentenceProb(const VocabIndex *sentence, TextStats &stats)
LogP sentenceProb(const VocabString *sentence, TextStats &stats)
unsigned pplFile(File &file, TextStats &stats, const char *escapeString = 0)
setState(const char *state);
wordProbSum(const VocabIndex *context)
VocabIndex generateWord(const VocabIndex *context)
VocabIndex *generateSentence(unsigned maxWords, VocabIndex *sentence)
VocabString *generateSentence(unsigned maxWords, VocabString *sentence)
Boolean isNonWord(VocabIndex word)
Boolean read(File &file);
void write(File &file);
EXTENSIBILITY/REUSABILITY
THINGS TO DO
- Node array interface
- General interpolated LMs
- LM "shell" for interactive model manipulation and use (Tcl based)

View File

@@ -0,0 +1,22 @@
Standard, speed-optimized libraries (using hash tables)
bin/i386-solaris/ngram -debug 1 -memuse -lm $EVAL2000/data/lms/devel2000/swbd+ch_en+h4eval2000-m-pruned.3bo.gz -write-vocab /dev/null
reading 34610 1-grams
reading 4826134 2-grams
reading 9733194 3-grams
total memory 276788424 (263.966M), used 170254952 (162.368M), wasted 106533472 (
101.598M)
Time elapsed 67.06 user 70.26 system 5.90 utilization 113.57%
"compact", space-optimized libraries (using sorted arrays)
bin/i386-solaris_c/ngram -debug 1 -memuse -lm $EVAL2000/data/lms/devel2000/swbd+ch_en+h4eval2000-m-pruned.3bo.gz -write-vocab /dev/null
reading 34610 1-grams
reading 4826134 2-grams
reading 9733194 3-grams
total memory 175131744 (167.019M), used 170254952 (162.368M), wasted 4876792 (4.
65087M)
Time elapsed 75.17 user 81.84 system 4.47 utilization 114.83%