当前位置：文江博客话题详情

有关于吉萨的教程吗？

发布于 2024-11-02 19:48:47 字数 1491 浏览 1 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

江南月 2024-11-09 19:48:47

以下内容摘自我为课程编写的教程。（注意：这假设您已在 *nix 系统上成功安装 GIZA++-v2。）

从两个包含已标记化的并行句子的数据文件开始，每行一个句子。例如，一对并行的英语-法语文件可能如下所示。

示例 1 - train.en

I gave him the book . 
He read the book . 
He loved the book .

示例 2 - train.fr

Je lui ai donne/ le livre .
Il a lu le livre .
Il aimait le livre .

通过 plain2snt.out 运行这些文件以获取目标和源词汇文件 (*.vcb）以及句子对文件（*.snt）。

从 GIZA++ 目录中，运行：

./plain2snt.out TEXT1 TEXT2

其中 TEXT1 和 TEXT2 是步骤 1 中描述的数据文件。

这会在与 TEXT1 相同的目录中生成四个文件> 和 TEXT2 （假设它们位于同一目录中）：

TEXT1_TEXT2.snt
TEXT1.vcb
TEXT2_TEXT1.snt
TEXT2.vcb

词汇文件包含文本中每个单词的唯一（整数）ID（注意：未标记化/词形还原）、单词/字符串以及该字符串出现的次数。它们由单个空格字符分隔。

句子文件包含数字。对于每个句子对，有三行：第一行是该句子对在语料库中出现的次数的计数，第二行和第三行是对应于语料库中单词条目的一串（以空格分隔的）数字。词汇文件。根据 *.snt 文件的命名约定，第一个文件被假定为源语言，第二个文件被假定为目标语言。例如，在文件 TEXT1_TEXT2.snt 中，第一行将是第一个句子对在语料库中出现的次数，第二行将是对应于的一串数字TEXT1.vcb 文件中的单词，第三行是与 TEXT2.vcb 文件中的单词对应的数字字符串。

现在，TEXT1.vcb、TEXT2.vcb 和两个 *.snt 文件中的任何一个都可以用作 GIZA++ 的输入来生成比对。

例如：

./GIZA++ -s TEXT1.vcb -t TEXT2.vcb -c TEXT1_TEXT2.snt

但请注意，当我尝试运行此程序时，我必须将 TEXT1_TEXT2.snt 重命名为名称中不带下划线的名称，才能获得正确的输出。

The following is excerpted from a tutorial I'm putting together for a class. (NB: This assumes you have successfully installed GIZA++-v2 on a *nix system.)

Start with two data files containing parallel sentences that have been tokenized, one sentence per line. For example, a pair of parallel English-French files might read as follows.

Sample 1 - train.en

I gave him the book . 
He read the book . 
He loved the book .

Sample 2 - train.fr

Je lui ai donne/ le livre .
Il a lu le livre .
Il aimait le livre .

Run these files through plain2snt.out to get target and source vocabulary files (*.vcb) as well as a sentence pair file (*.snt).

From the GIZA++ directory, run:

./plain2snt.out TEXT1 TEXT2

where TEXT1 and TEXT2 are the data files described in step 1.

This produces four files in the same directory as TEXT1 and TEXT2 (assuming they are in the same directory):

TEXT1_TEXT2.snt
TEXT1.vcb
TEXT2_TEXT1.snt
TEXT2.vcb

The vocab files contain a unique (integer) ID for each word in the text (NB: not tokenized/lemmatized), the word/string, and the number of times that string occurred. These are separated by a single space character.

The sentence files contain numbers. For each sentence pair, there are three lines: the first is a count of the number of times that sentence pair occurs in the corpus and the second and third are a string of (space-separated) numbers corresponding to the entries for words in the vocab files. Based on the naming convention for *.snt files, the first file is assumed to be the source, and the second is assumed to be the target language. For example, in the file TEXT1_TEXT2.snt, the first line will be a count of the number of times the first sentence-pair occurred in the corpus, the second line will be a string of numbers corresponding to words in the TEXT1.vcb file, and the third line will be a string of numbers corresponding to words in the TEXT2.vcb file.

Now TEXT1.vcb, TEXT2.vcb, and either of the two *.snt files can be used as input to GIZA++ to produce an alignment.

For example:

./GIZA++ -s TEXT1.vcb -t TEXT2.vcb -c TEXT1_TEXT2.snt

But note that when I tried to run this, I had to rename TEXT1_TEXT2.snt to something without an underscore in the name in order to get any proper output.

回复收藏 0 原文

夜司空 2024-11-09 19:48:47

这个Powerpoint教程对我有用：http://www.tc.umn.edu /~bthomson/wordalignment/GIZA.ppt

回复收藏 0 原文

夢归不見 2024-11-09 19:48:47

这非常有帮助：
http://fabioticconi.wordpress.com/2011/01/17/how-to-do-a-word-alignment-with-giza-or-mgiza-from-parallel-corpus/

IIT-B 学者为 GIZA++ 和 MOSES 设置和使用提供了精彩而详细的演示。

其中一些是：
http://www. cse.iitb.ac.in/~pb/cs712-2013/potpouri/kashyap-giza-mozes-jan2013.pdf

http://www.cse.iitb.ac.in/~anoopk/publications/presentations/moses_giza_intro.pdf

http://www.cfilt.iitb.ac.in/Moses-Tutorial.pdf

回复收藏 0 原文

鸠书 2024-11-09 19:48:47

也许是这个？

http ://code.google.com/p/giza-pp/issues/attachmentText?id=8&aid=697742396599277757&name=README-rst&token=40fba3d449abc12366b98b04cfe7dbc1

完整来源：http://code.google.com/p/giza-pp/issues/detail?id= 8

回复收藏 0 原文

故事灯 2024-11-09 19:48:47

这里有关于如何格式化输入文件以及如何运行 GIZA++ 的补充说明：

http://www.tc.umn.edu/~bthomson/wordalignment/GIZAREADME.txt

回复收藏 0 原文

~没有更多了~

关于作者

三生殊途

暂无简介

0 文章

0 评论

21 人气

关注发私信

友情链接

文江博客

有关于吉萨的教程吗？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

有关于吉萨的教程吗？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。