自然语言处理-词对齐

发布于 2024-08-25 06:55:39 字数 733 浏览 8 评论 0原文

我正在寻找单词对齐工具和算法。
我正在处理英语 - 印地语双语文本,目前正在处理

你能建议一下吗其他独立于语言的算法/工具,可以实现并行英语印地语语料库的统计单词对齐及其评估
有些工具最适合某些语言;您能告诉我这有多真实吗?如果是这样,您能否提供一个更适合印地语等亚洲语言的示例。也欢迎我不应该使用此类语言的反例。

我听说过一些关于Uplug 单词对齐器的信息。 ..有人可以告诉我这个工具是否对我的目的有用。

谢谢.. :)

I am looking for word alignment tools and algorithms.
I am dealing with bilingual English - Hindi text, and currently working on

  • DTW (Dynamic Time Warping) algorithm
  • CLA (Competitive Linking Algorithm)
  • NATools
  • Giza++

Could you please suggest any other algorithm/tool which is language independent and which could achieve Statistical word alignment for parallel English Hindi Corpora and its evaluation.
Some tools are best for certain languages; could you please tell me how true that is and, if so, could you please provide an example of what would be better suited for Asian languages like Hindi. Counter-examples of what one shouldn't I use for such languages is also welcome.

I have heard a bit about Uplug word aligner... Could someone tell me if this tool is useful for my purpose.

Thank you.. :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

野稚 2024-09-01 06:55:39

伯克利对齐器非常好。通过对 IBM 单词对齐模型进行联合训练,它能够获得比 GIZA++ 等旧包低得多的对齐错误率 (AER)。

它还支持一些更高级的功能,例如语法扭曲(即使用解析树信息来获得更好的对齐)。为此,您只需要其中一种语言对的解析树。所以,你应该可以使用印地语<->英语,因为有很多免费的、好的英语解析器。

如果您决定不使用 Berkeley Aligner,您可能应该使用 GIZA++。多年来,它本质上一直是机器翻译社区中的标准单词对齐器。

The Berkeley Aligner is very good. By doing joint training of the IBM word alignment models, it's able to get a much lower alignment error rate (AER) than older packages like GIZA++.

It also supports some more advanced features such as syntactic distortion (i.e., using parse tree information to get better alignments). For this, you'll only need parse trees for one of the language pairs. So, you should be okay doing Hindi<->English, since there are plenty of freely available and good English parsers.

If you decide not to go with the Berkeley Aligner, you should probably just use GIZA++. For years, it has been essentially the standard word aligner in the machine translation community.

大海や 2024-09-01 06:55:39

Uplug 是一个很棒的工具,我一直用它来对齐英语<->马其顿语文本。
它本质上是在 Giza++ 的基础上添加了所谓的线索对齐。它的高级设置实际上结合了线索对齐和 Giza++ 并执行 3 次这样的迭代。您提供的线索(后置标签、引理……)越多,结果就越好。但我不得不提的是,您不应该指望通过使用 Giza++ 获得根本不同的结果。

不管怎样,如果你打算认真研究SMT这个话题,我建议你阅读一下关于Uplug的论文(phd论文),这对你来说是非常有益的。

Uplug is a great tool, I have been using it for aligning English<->Macedonian texts.
It essentially builds on the Giza++ by adding the so-called clue alignments. It's advanced setting actually combines the the clue alignments and Giza++ and performs 3 such iterations. The more clues (pos-tags, lemmas ... ) you provide better the results will be. But I have to mention that you should not expect to get fundamentally different results then by just using Giza++.

Anyway, if you plan to seriously study the topic of SMT, I suggest that you read the paper (phd thesis) about Uplug, it will be very beneficial for you.

相守太难 2024-09-01 06:55:39

Moses 是一个您可能想看一下的统计机器翻译套件。它的单词对齐组件基于 GIZA++ 构建,但可以进行调整,以便比纯 GIZA++ 更好地处理某些语言对。他们的邮件列表和您可以在 http://www.statmt.org/ 上找到的资源也可能是一个比 SO 更好的地方来询问这个主题的问题。您没有提到但我认为更成问题的一件事是从哪里获得平行语料库印地语 <->英语。

Moses is a statistical machine translation suite you might want to take a look at. Its word alignment component is built on GIZA++ but may be tweaked to work better with certain language pairs than pure GIZA++. Their mailing list and the resources you can find on http://www.statmt.org/ may also be a better place to ask questions on this topic than SO. One thing you didn't say anything about but which I would consider even more problematic is where to get a parallel corpus Hindi <-> English.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文