Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 10 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(5)
以下内容摘自我为课程编写的教程。 (注意:这假设您已在 *nix 系统上成功安装 GIZA++-v2。)
示例 1 -
train.en
示例 2 -
train.fr
plain2snt.out
运行这些文件以获取目标和源词汇文件 (*.vcb
)以及句子对文件(*.snt
)。从 GIZA++ 目录中,运行:
其中
TEXT1
和TEXT2
是步骤 1 中描述的数据文件。这会在与
TEXT1
相同的目录中生成四个文件> 和TEXT2
(假设它们位于同一目录中):词汇文件包含文本中每个单词的唯一(整数)ID(注意:未标记化/词形还原)、单词/字符串以及该字符串出现的次数。它们由单个空格字符分隔。
句子文件包含数字。对于每个句子对,有三行:第一行是该句子对在语料库中出现的次数的计数,第二行和第三行是对应于语料库中单词条目的一串(以空格分隔的)数字。词汇文件。根据
*.snt
文件的命名约定,第一个文件被假定为源语言,第二个文件被假定为目标语言。例如,在文件TEXT1_TEXT2.snt
中,第一行将是第一个句子对在语料库中出现的次数,第二行将是对应于的一串数字TEXT1.vcb
文件中的单词,第三行是与TEXT2.vcb
文件中的单词对应的数字字符串。TEXT1.vcb
、TEXT2.vcb
和两个*.snt
文件中的任何一个都可以用作 GIZA++ 的输入来生成比对。例如:
但请注意,当我尝试运行此程序时,我必须将
TEXT1_TEXT2.snt
重命名为名称中不带下划线的名称,才能获得正确的输出。The following is excerpted from a tutorial I'm putting together for a class. (NB: This assumes you have successfully installed GIZA++-v2 on a *nix system.)
Sample 1 -
train.en
Sample 2 -
train.fr
plain2snt.out
to get target and source vocabulary files (*.vcb
) as well as a sentence pair file (*.snt
).From the GIZA++ directory, run:
where
TEXT1
andTEXT2
are the data files described in step 1.This produces four files in the same directory as
TEXT1
andTEXT2
(assuming they are in the same directory):The vocab files contain a unique (integer) ID for each word in the text (NB: not tokenized/lemmatized), the word/string, and the number of times that string occurred. These are separated by a single space character.
The sentence files contain numbers. For each sentence pair, there are three lines: the first is a count of the number of times that sentence pair occurs in the corpus and the second and third are a string of (space-separated) numbers corresponding to the entries for words in the vocab files. Based on the naming convention for
*.snt
files, the first file is assumed to be the source, and the second is assumed to be the target language. For example, in the fileTEXT1_TEXT2.snt
, the first line will be a count of the number of times the first sentence-pair occurred in the corpus, the second line will be a string of numbers corresponding to words in theTEXT1.vcb
file, and the third line will be a string of numbers corresponding to words in theTEXT2.vcb
file.TEXT1.vcb
,TEXT2.vcb
, and either of the two*.snt
files can be used as input to GIZA++ to produce an alignment.For example:
But note that when I tried to run this, I had to rename
TEXT1_TEXT2.snt
to something without an underscore in the name in order to get any proper output.这个Powerpoint教程对我有用:http://www.tc.umn.edu /~bthomson/wordalignment/GIZA.ppt
This Powerpoint tutorial worked for me: http://www.tc.umn.edu/~bthomson/wordalignment/GIZA.ppt
这非常有帮助:
http://fabioticconi.wordpress.com/2011/01/17/how-to-do-a-word-alignment-with-giza-or-mgiza-from-parallel-corpus/
IIT-B 学者为 GIZA++ 和 MOSES 设置和使用提供了精彩而详细的演示。
其中一些是:
http://www. cse.iitb.ac.in/~pb/cs712-2013/potpouri/kashyap-giza-mozes-jan2013.pdf
http://www.cse.iitb.ac.in/~anoopk/publications/presentations/moses_giza_intro.pdf
http://www.cfilt.iitb.ac.in/Moses-Tutorial.pdf
This one is very helpful :
http://fabioticconi.wordpress.com/2011/01/17/how-to-do-a-word-alignment-with-giza-or-mgiza-from-parallel-corpus/
IIT-B scholars have put up nice and detailed presentations for GIZA++ and MOSES setup and use.
Some of them are :
http://www.cse.iitb.ac.in/~pb/cs712-2013/potpouri/kashyap-giza-mozes-jan2013.pdf
http://www.cse.iitb.ac.in/~anoopk/publications/presentations/moses_giza_intro.pdf
http://www.cfilt.iitb.ac.in/Moses-Tutorial.pdf
也许是这个?
http ://code.google.com/p/giza-pp/issues/attachmentText?id=8&aid=697742396599277757&name=README-rst&token=40fba3d449abc12366b98b04cfe7dbc1
完整来源:http://code.google.com/p/giza-pp/issues/detail?id= 8
This one maybe ?
http://code.google.com/p/giza-pp/issues/attachmentText?id=8&aid=697742396599277757&name=README-rst&token=40fba3d449abc12366b98b04cfe7dbc1
Full source : http://code.google.com/p/giza-pp/issues/detail?id=8
这里有关于如何格式化输入文件以及如何运行 GIZA++ 的补充说明:
http://www.tc.umn.edu/~bthomson/wordalignment/GIZAREADME.txt
There is a supplemental explanation of how to format input files and how to run GIZA++ over here:
http://www.tc.umn.edu/~bthomson/wordalignment/GIZAREADME.txt