To use each of the packages above, you'll need training data. If you're translating between many European languages you can use Phillip Koehn's Europarl parallel corpus. If you're interested in a European Union (EU) language that's not in the Europarl parallel corpus, you can gather the data by crawling the proceedings of the European parliament. All the EU proceedings are translated into each of the EU languages and made available for free online, which makes them a very good source of machine translation training data.
发布评论
评论(1)
开源翻译包
以下是一些最先进的开源机器翻译包:
训练数据
要使用上面的每个包,您需要训练数据。如果您要在多种欧洲语言之间进行翻译,可以使用 Phillip Koehn 的 Europarl 平行语料库。如果您对 Europarl 平行语料库中没有的欧盟 (EU) 语言感兴趣,您可以通过爬网 欧洲议会会议记录。所有欧盟会议记录都被翻译成每种欧盟语言,并免费在线提供,这使得它们成为机器翻译培训数据的良好来源。
您可以从 语言数据联盟 (LDC) 目录获取非欧洲语言的训练数据(例如,< a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T09" rel="nofollow">汉译英)。
Open Source Translation Packages
Here are some state-of-the-art open-source machine translation packages:
Training Data
To use each of the packages above, you'll need training data. If you're translating between many European languages you can use Phillip Koehn's Europarl parallel corpus. If you're interested in a European Union (EU) language that's not in the Europarl parallel corpus, you can gather the data by crawling the proceedings of the European parliament. All the EU proceedings are translated into each of the EU languages and made available for free online, which makes them a very good source of machine translation training data.
You can get training data for non-European languages from the Linguistics Data Consortium (LDC) catalog (e.g., Chinese-to-English).