我有一套两种语言的文件:英语和德语。这些文档没有可用的元信息,程序只能查看内容。基于此,程序必须决定文档是用两种语言中的哪一种编写的。
是否有任何“标准”算法可以在几个小时内实现这个问题?或者,有一个免费的 .NET 库或工具包可以做到这一点?我知道 LingPipe,但它是
- Java
- 不能免费用于“半商业”用途
这个问题似乎出奇地艰难。我查看了 Google AJAX Language API(我通过搜索此找到首先是网站),但这非常糟糕。对于我指出的六个德语网页,只有一个猜测是正确的。其他猜测是瑞典语、英语、丹麦语和法语……
我想出的一个简单方法是使用停用词列表。我的应用程序已经使用这样的德语文档列表,以便使用 Lucene.Net 对其进行分析。如果我的应用程序扫描文档以查找任一语言中出现的停用词,则出现次数较多的应用程序将获胜。诚然,这是一种非常幼稚的方法,但它可能已经足够好了。不幸的是,我没有时间成为自然语言处理方面的专家,尽管这是一个有趣的话题。
I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in.
Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? Or alternatively, a free .NET library or toolkit that can do this? I know about LingPipe, but it is
- Java
- Not free for "semi-commercial" usage
This problem seems to be surprisingly hard. I checked out the Google AJAX Language API (which I found by searching this site first), but it was ridiculously bad. For six web pages in German to which I pointed it only one guess was correct. The other guesses were Swedish, English, Danish and French...
A simple approach I came up with is to use a list of stop words. My app already uses such a list for German documents in order to analyze them with Lucene.Net. If my app scans the documents for occurrences of stop words from either language the one with more occurrences would win. A very naive approach, to be sure, but it might be good enough. Unfortunately I don't have the time to become an expert at natural-language processing, although it is an intriguing topic.
发布评论
评论(11)
尝试测量文本中每个字母的出现次数。对于英语和德语文本,计算频率,也许还计算它们的分布。获得这些数据后,您可以推断出文本的频率分布属于哪种语言。
您应该使用贝叶斯推理来确定最接近的语言(具有一定的错误概率),或者也许还有其他统计方法可用于此类任务。
Try measure occurences of each letter in text. For English and German texts are calculated the frequencies and, maybe, the distributions of them. Having obtained these data, you may reason what language the distribution of frequencies for your text belongs.
You should use Bayesian inference to determine the closest language (with a certain error probability) or, maybe, there are other statistical methods for such tasks.
使用停用词列表的问题之一是鲁棒性。停用词列表基本上是一组规则,每个单词一个规则。基于规则的方法对于看不见的数据的鲁棒性往往不如统计方法。您将遇到的一些问题是包含相同数量的每种语言的停用词的文档、没有停用词的文档、包含错误语言的停用词的文档等。基于规则的方法无法执行其规则所做的任何事情。 t 指定。
一种不需要您自己实现朴素贝叶斯或任何其他复杂的数学或机器学习算法的方法是计算字符二元组和三元组(取决于您是否有大量或少量的数据开始 - 二元组将使用较少的训练数据)。对已知源语言的少量文档(越多越好)运行计数,然后按计数数量为每种语言构建有序列表。例如,英语将“th”作为最常见的二元词。掌握有序列表后,计算要分类的文档中的二元组并将它们按顺序排列。然后遍历每个文档并将其在已排序的未知文档列表中的位置与其在每个训练列表中的排名进行比较。为每种语言的每个二元组指定一个分数,格式为
1 / ABS(RankInUnknown - RankInLanguage + 1)
。最终得分最高的语言就是获胜者。它很简单,不需要大量编码,也不需要大量训练数据。更好的是,您可以继续向其中添加数据,并且它会有所改进。另外,您不必手动创建停用词列表,也不会仅仅因为文档中没有停用词而失败。
它仍然会被包含相等对称二元组计数的文档所困惑。如果您可以获得足够的训练数据,使用三元组将减少这种情况的发生。但使用三元组意味着您还需要更长的未知文档。非常短的文档可能需要您减少到单个字符(一元组)计数。
综上所述,您将会遇到错误。没有灵丹妙药。组合方法并选择能够最大限度地提高您对每种方法的信心的语言可能是最明智的做法。
The problem with using a list of stop words is one of robustness. Stop word lists are basically a set of rules, one rule per word. Rule-based methods tend to be less robust to unseen data than statistical methods. Some problems you will encounter are documents that contain equal counts of stop words from each language, documents that have no stop words, documents that have stop words from the wrong language, etc. Rule-based methods can't do anything their rules don't specify.
One approach that doesn't require you to implement Naive Bayes or any other complicated math or machine learning algorithm yourself, is to count character bigrams and trigrams (depending on whether you have a lot or a little of data to start with -- bigrams will work with less training data). Run the counts on a handful of documents (the more the better) of known source language and then construct an ordered list for each language by the number of counts. For example, English would have "th" as the most common bigram. With your ordered lists in hand, count the bigrams in a document you wish to classify and put them in order. Then go through each one and compare its location in the sorted unknown document list to the its rank in each of the training lists. Give each bigram a score for each language as
1 / ABS(RankInUnknown - RankInLanguage + 1)
.Whichever language ends up with the highest score is the winner. It's simple, doesn't require a lot of coding, and doesn't require a lot of training data. Even better, you can keep adding data to it as you go on and it will improve. Plus, you don't have to hand-create a list of stop words and it won't fail just because there are no stop words in a document.
It will still be confused by documents that contain equal symmetrical bigram counts. If you can get enough training data, using trigrams will make this less likely. But using trigrams means you also need the unknown document to be longer. Really short documents may require you to drop down to single character (unigram) counts.
All this said, you're going to have errors. There's no silver bullet. Combining methods and choosing the language that maximizes your confidence in each method may be the smartest thing to do.
英语和德语使用相同的字母集,除了 ä、ö、ü 和 ß (eszett)。您可以查找这些字母来确定语言。
您还可以查看此文本(比较两种语言识别方案)来自 Grefenstette。它着眼于字母卦和短词。德语 en_、er_、_de 的常见三元组。英语常用三字组 the_、he_、the...
还有 Bob Carpenter 的 LingPipe如何进行语言ID?
English and German use the same set of letters except for ä, ö, ü and ß (eszett). You can look for those letters for determining the language.
You can also look at this text (Comparing two language identification schemes) from Grefenstette. It looks at letter trigrams and short words. Common trigrams for german en_, er_, _de. Common trigrams for English the_, he_, the...
There’s also Bob Carpenter’s How does LingPipe Perform Language ID?
语言检测从概念上来说并不是很困难。请查看我对 相关问题以及同一问题的其他回复。
如果您想尝试自己编写它,您应该能够在半天内编写一个简单的检测器。我们在工作中使用类似于以下算法的东西,并且它的效果出奇的好。另请阅读我链接的帖子中的 python 实现教程。
步骤:
获取两种语言的两个语料库,并提取字符级二元组、三元组和空格分隔的标记(单词)。跟踪它们的频率。此步骤为两种语言构建“语言模型”。
给定一段文本,识别每个语料库的字符二元组、三元组和空格分隔的标记及其相应的“相对频率”。如果模型中缺少特定的“特征”(字符二元组/三元组或标记),请将其“原始计数”视为 1 并使用它来计算其“相对频率”。
特定语言的相对频率的乘积给出了该语言的“分数”。这是该句子属于该语言的概率的非常简单的近似。
得分较高的语言获胜。
注 1:对于我们的语言模型中未出现的特征,我们将“原始计数”视为 1。这是因为,实际上,该特征的值非常小,但由于我们的语料库有限,我们可能还没有遇到过它。如果您将其计数为零,那么您的整个乘积也将为零。为了避免这种情况,我们假设它在我们的语料库中出现了 1 次。这称为加一平滑。还有其他高级平滑技术。
注 2:由于您将乘以大量分数,因此您可以轻松地运行到零。为了避免这种情况,您可以在对数空间中工作并使用此方程来计算您的分数。
注3:我描述的算法是“朴素贝叶斯算法”。
Language detection is not very difficult conceptually. Please look at my reply to a related question and other replies to the same question.
In case you want to take a shot at writing it yourself, you should be able to write a naive detector in half a day. We use something similar to the following algorithm at work and it works surprisingly well. Also read the python implementation tutorial in the post I linked.
Steps:
Take two corpora for the two languages and extract character level bigrams, trigrams and whitespace-delimited tokens (words). Keep a track of their frequencies. This step builds your "Language Model" for both languages.
Given a piece of text, identify the char bigrams, trigrams and whitespace-delimited tokens and their corresponding "relative frequencies" for each corpus. If a particular "feature" (char bigram/trigram or token) is missing from your model, treat its "raw count" as 1 and use it to calculate its "relative frequency".
The product of the relative frequencies for a particular language gives the "score" for the language. This is a very-naive approximation of the probability that the sentence belongs to that language.
The higher scoring language wins.
Note 1: We treat the "raw count" as 1 for features that do not occur in our language model. This is because, in reality, that feature would have a very small value but since we have a finite corpus, we may not have encountered it yet. If you take it's count to be zero, then your entire product would also be zero. To avoid this, we assume that it's occurence is 1 in our corpus. This is called add-one smoothing. There are other advance smoothing techniques.
Note 2: Since you will be multiplying large number of fractions, you can easily run to zero. To avoid this, you can work in a logarithmic space and use this equation to calculate your score.
Note 3: The algorithm I described is a "very-naive" version of the "Naive Bayes Algorithm".
我相信标准程序是用测试数据(即使用 语料库)。定义您希望算法实现的正确分析的百分比,然后对您手动分类的大量文档运行它。
至于具体的算法:使用停用词列表听起来不错。据报道,另一种有效的方法是使用贝叶斯过滤器,例如SpamBayes。与其将其训练为火腿邮件和垃圾邮件,不如将其训练为英语和德语。使用语料库的一部分,通过 spambayes 运行它,然后在完整的数据上进行测试。
I believe the standard procedure is to measure the quality of a proposed algorithm with test data (i.e. with a corpus). Define the percentage of correct analysis that you would like the algorithm to achieve, and then run it over a number of documents which you have manually classified.
As for the specific algorithm: using a list of stop words sounds fine. Another approach that has been reported to work is to use a Bayesian Filter, e.g. SpamBayes. Rather than training it into ham and spam, train it into English and German. Use a portion of your corpus, run that through spambayes, and then test it on the complete data.
如果您希望发挥自己的编程能力来尝试自己解决问题,我鼓励您这样做;但是,如果您愿意使用轮子,它就存在。
Windows 7 附带了此内置功能。名为“扩展语言服务”(ELS) 的组件能够检测脚本和自然语言,并且它位于任何 Windows 7 或 Windows Server 2008 计算机上。根据您是否有任何此类机器可用以及您所说的“免费”时的意思,这将为您完成。无论如何,这是谷歌或此处提到的其他供应商的替代方案。
http://msdn.microsoft.com/en -us/library/dd317700(v=VS.85).aspx
如果您想从 .NET 访问它,这里有一些相关信息:
http://windowsteamblog.com/blogs/developers/archive/2009/05/18 /windows-7-management-code-apis.aspx
希望有所帮助。
If you're looking to flex your programming muscles trying to solve the problem yourself, I encourage you to; however, the wheel exists if you would like you use it.
Windows 7 ships with this functionality built in. A component called "Extended Linguistic Services" (ELS) has the ability to detect scripts and natural languages, and it's in the box, on any Windows 7 or Windows Server 2008 machine. Depending on whether you have any such machines available and what you mean when you say "free," that will do it for you. In any case, this is an alternative to Google or the other vendors mentioned here.
http://msdn.microsoft.com/en-us/library/dd317700(v=VS.85).aspx
And if you want to access this from .NET, there's some information on that here:
http://windowsteamblog.com/blogs/developers/archive/2009/05/18/windows-7-managed-code-apis.aspx
Hope that helps.
这两种语言的停用词方法很快,并且通过对另一种语言中不出现的词(例如德语中的“das”和英语中的“the”)进行重权重,可以使速度更快。 “专有词”的使用也将有助于将该方法稳健地扩展到更大的语言组。
The stop words approach for the two languages is quick and would be made quicker by heavily weighting ones that don't occur in the other language "das" in German and "the" in English, for example. The use of the "exclusive words" would help extend this approach robustly over a larger group of languages as well.
如果您只有两种语言(英语和德语)可供选择,那么问题不是会容易几个数量级吗?在这种情况下,您使用停用词列表的方法可能就足够了。
显然,如果您向列表中添加了更多语言,则需要考虑重写。
Isn't the problem several orders of magnitude easier if you've only got two languages (English and German) to choose from? In this case your approach of a list of stop words might be good enough.
Obviously you'd need to consider a rewrite if you added more languages to your list.
首先,您应该对当前的解决方案进行测试,看看它是否达到您想要的准确性水平。在您的特定领域取得成功比遵循标准程序更重要。
如果您的方法需要改进,请尝试根据大型英语和德语语料库中的稀有性来衡量停用词的权重。或者您可以使用更复杂的技术,例如训练 马尔可夫模型 或 贝叶斯分类器。您可以扩展任何算法来查看高阶 n-gram (例如例如,两个或三个单词序列)或文本中的其他特征。
First things first, you should set up a test of your current solution and see if it reaches your desired level of accuracy. Success in your specific domain matters more than following a standard procedure.
If your method needs improving, try weighting your stop words by the rarity in a large corpus of English and German. Or you could use a more complicated technique like training a Markov model or Bayesian classifier. You could expand any of the algorithms to look at higher-order n-grams (for example, two or three word sequences) or other features in the text.
您可以使用 Google 语言检测 API。
这是一个使用它的小程序:
其他有用的参考:
Google 宣布 API(和演示):
http://googleblog.blogspot。 com/2008/03/new-google-ajax-language-api-tools-for.html
Python 包装器:
http://code.activestate.com /recipes/576890-python-wrapper-for-google-ajax-language-api/
另一个 python 脚本:
http://www.halotis.com/2009 /09/15/google-translate-api-python-script/
RFC 1766 定义了 W3C 语言
从以下位置获取当前语言代码:
http://www.iana.org/assignments/language-subtag-registry
You can use the Google Language Detection API.
Here is a little program that uses it:
Other useful references:
Google Announces APIs (and demo):
http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html
Python wrapper:
http://code.activestate.com/recipes/576890-python-wrapper-for-google-ajax-language-api/
Another python script:
http://www.halotis.com/2009/09/15/google-translate-api-python-script/
RFC 1766 defines W3C languages
Get the current language codes from:
http://www.iana.org/assignments/language-subtag-registry
您尝试过 Apache Tika 吗?它可以确定给定文本的语言:
http://www.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your -net-application-with-ikvm
我没有 .Net 经验,但该链接可能会有所帮助。如果您可以在您的环境中执行 jar,请尝试以下操作:
输出:
希望有帮助。
Have you tried Apache Tika? It can determine the language of a given text:
http://www.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm
I have no experience with .Net but that link might help. If you can execute a jar in your environment, try this:
Output:
Hope that helps.