如何判断纯文本文件是用什么语言编写的?
假设我们有一个文本文件,其内容为: “Je suis un beau homme ...”
另一个是: 第三个“我是一个勇敢的人”,
附有德语文本: “Guten morgen。Wie geht 的?”
我们如何编写一个函数来告诉我们:第一个文本中的文本有这样的概率 文件是英文的,第二个我们有法文等?
欢迎链接到书籍/开箱即用的解决方案。我用 Java 编写,但如果需要的话我可以学习 Python。
我的评论
- 我需要添加一条小评论。文本可能包含不同语言的短语,作为整体的一部分或作为错误的结果。在经典文学中我们有很多这样的例子,因为贵族成员会说多种语言。因此概率更好地描述了情况,因为文本的大部分部分都是用一种语言编写的,而其他部分可能是用另一种语言编写的。
- Google API - 互联网连接。我不想使用远程功能/服务,因为我需要自己做或使用可下载的库。我想对这个话题进行研究。
Suppose we have a text file with the content:
"Je suis un beau homme ..."
another with:
"I am a brave man"
the third with a text in German:
"Guten morgen. Wie geht's ?"
How do we write a function that would tell us: with such a probability the text in the first
file is in English, in the second we have French etc?
Links to books / out-of-the-box solutions are welcome. I write in Java, but I can learn Python if needed.
My comments
- There's one small comment I need to add. The text may contain phrases in different languages, as part of whole or as a result of a mistake. In classic litterature we have a lot of examples, because the aristocracy members were multilingual. So the probability better describes the situation, as most parts of the text are in one language, while others may be written in another.
- Google API - Internet Connection. I would prefer not to use remote functions/services, as I need to do it myself or use a downloadable library. I'd like to make a research on that topic.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
有一个名为 JLangDetect 的包,它似乎完全符合您的要求:
编辑:as Kevin指出,Nutch 项目 包 org.apache.nutch.analysis.lang。
There is a package called JLangDetect which seems to do exactly what you want:
Edit: as Kevin pointed out, there is similar functionality in the Nutch project provided by the package org.apache.nutch.analysis.lang.
Google 进行的语言检测:http://code.google.com/apis/ajaxlanguage/documentation /#检测
Language detection by Google: http://code.google.com/apis/ajaxlanguage/documentation/#Detect
对于较大的文本语料库,您通常使用字母、二合字母甚至三字母的分布,并与要检测的语言的已知分布进行比较。
然而,单个句子很可能太短,无法产生任何有用的统计指标。那么,你可能会更幸运地将单个单词与字典相匹配。
For larger corpi of texts you usually use the distribution of letters, digraphs and even trigraphs and compare with known distributions for languages you want to detect.
However, a single sentence is very likely too short to yield any useful statistical measures. You may have more luck with matching individual words with a dictionary, then.
NGramJ 似乎更新了一些:
http://ngramj.sourceforge.net/
它还具有面向字符和面向字节的配置文件,因此它也应该能够识别字符集。
对于多种语言的文档,您需要识别字符集(ICU4J 有一个 CharsetDetector 可以做到这一点),然后将文本拆分为某些内容如果文本已标记,则可以像多个换行符或段落一样合理。
NGramJ seems to be a bit more up-to-date:
http://ngramj.sourceforge.net/
It also has both character-oriented and byte-oriented profiles, so it should be able to identify the character set too.
For documents in multiple languages you need to identify the character set (ICU4J has a CharsetDetector that can do this), then split the text on something resonable like multiple line breaks, or paragraphs if the text is marked up.
尝试 Nutch 的 语言标识符。它使用语言的 n-gram 配置文件进行训练,并将可用语言的配置文件与输入文本进行匹配。有趣的是,如果需要,您可以添加更多语言。
Try Nutch's Language Identifier. It is trained with n-gram profiles of languages and profile of available languages is matched with input text. Interesting thing is you can add more languages, if you need.
查找马尔可夫链。
基本上,您将需要您想要识别的语言的具有统计意义的样本。当您获得新文件时,查看特定音节或音素的频率是多少,并与预先计算的样本进行比较。选择最接近的一个。
Look up Markov chains.
Basically you will need statistically significant samples of the languages you want to recognize. When you get a new file, see what the frequencies of specific syllables or phonemes are, and compare the the pre-calculated sample. Pick the closest one.
虽然这是一个比您想要的更复杂的解决方案,但您可以使用 Vowpal Wabbit 并使用不同语言的句子对其进行训练。
理论上,您可以为文档中的每个句子找回一种语言。
http://hunch.net/~vw/
(不要被“网上”所迷惑在项目的副标题中 - 这只是学习的数学语言,无需记住整个学习材料)
Although a more complicated solution than you are looking for, you could use Vowpal Wabbit and train it with sentences from different languages.
In theory you could get back a language for every sentence in your documents.
http://hunch.net/~vw/
(Don't be fooled by the "online" in the project's subtitle - that's just mathspeak for learns without having to have whole learning material in memory)
如果您对执行语言检测的机制感兴趣,我建议您参考以下文章(基于Python)使用了一种(非常)幼稚的方法,但很好地介绍了这个问题,特别是机器学习(只是一个大词)一般来说。
对于 java 实现,其他海报建议的 JLangDetect 和 Nutch 非常好。另请查看 Lingpipe,JTCL 和 NGramJ 。
对于同一页面中有多种语言的问题,您可以使用句子边界检测器将页面切成句子,然后尝试识别每个句子的语言。假设一个句子仅包含一种(主要)语言,则使用上述任何实现仍然应该获得良好的结果。
注意:句子边界检测器(SBD)理论上是特定于语言的(先有鸡还是先有蛋的问题,因为你需要一个来代替另一个)。但对于主要使用句点(感叹号等除外)进行句子定界的基于拉丁文字的语言(英语、法语、德语等),即使您使用专为英语设计的 SBD,您也会得到可接受的结果。我编写了一个基于规则的英语 SBD,它对于法语文本非常有效。有关实现,请查看 OpenNLP。
使用 SBD 的另一种选择是使用 10 个标记(空格分隔)的滑动窗口来创建一个伪句子 (PS),并尝试识别语言发生变化的边界。这样做的缺点是,如果您的整个文档有 n 个标记,您将对每个长度为 10 个标记的字符串执行大约 n-10 次分类操作。在另一种方法中,如果平均句子有 10 个标记,您将执行大约 n/10 次分类操作。如果文档中 n = 1000 个单词,则您将比较 990 次操作与 100 次操作:一个数量级的差异。
根据我的经验,如果您有简短的短语(少于 20 个字符),语言检测的准确性很差。特别是在专有名词以及跨语言相同的名词(例如“巧克力”)的情况下。例如,如果“纽约”出现在法语句子中,它是英语单词还是法语单词?
If you are interested in the mechanism by which language detection can be performed, I refer you to the following article (python based) that uses a (very) naive method but is a good introduction to this problem in particular and machine learning (just a big word) in general.
For java implementations, JLangDetect and Nutch as suggested by the other posters are pretty good. Also take a look at Lingpipe, JTCL and NGramJ.
For the problem where you have multiple languages in the same page, you can use a sentence boundary detector to chop a page into sentences and then attempt to identify the language of each sentence. Assuming that a sentence contains only one (primary) language, you should still get good results with any of the above implementations.
Note: A sentence boundary detector (SBD) is theoretically language specific (chicken-egg problem since you need one for the other). But for latin-script based languages (English, French, German, etc.) that primarily use periods (apart from exclamations etc.) for sentence delimiting, you will get acceptable results even if you use an SBD designed for English. I wrote a rules-based English SBD that has worked really well for French text. For implementations, take a look at OpenNLP.
An alternative option to using the SBD is to use a sliding window of say 10 tokens (whitespace delimited) to create a pseudo-sentence (PS) and try and identify the border where the language changes. This has the disadvantage that if your entire document has n tokens, you will perform approximately n-10 classification operations on strings of length 10 tokens each. In the other approach, if the average sentence has 10 tokens, you would have performed approximately n/10 classification operations. If n = 1000 words in a document, you are comparing 990 operations versus 100 operations: an order of magnitude difference.
If you have short phrases (under 20 characters), accuracy of language detection is poor in my experience. Particularly in the case of proper nouns as well as nouns that are same across languages like "chocolate". E.g. Is "New York" an English word or a French word if it appears in a French sentence?
您可以连接到互联网吗?如果可以,那么 Google Language API 将非常适合您。
如果没有的话还有其他方法。
Do you have connection to the internet if you do then Google Language API would be perfect for you.
If you don't there are other methods.
二元模型表现良好,编写简单,训练简单,并且只需要少量文本即可进行检测。 nutch 语言标识符是我们发现并与薄包装器一起使用的 java 实现。
我们在混合 CJK 和英语文本的二元模型方面遇到了问题(即推文主要是日语,但只有一个英语单词)。回顾数学,这一点是显而易见的(日语有更多的字符,因此任何给定对的概率都很低)。我认为你可以通过一些更复杂的对数线性比较来解决这个问题,但我作弊并使用了一个基于某些语言特有的字符集的简单过滤器(即,如果它只包含统一的汉文,那么它是中文,如果它包含一些日文假名统一汉文,那就是日文了)。
bigram models perform well, are simple to write, simple to train, and require only a small amount of text for detection. The nutch language identifier is a java implementation we found and used with a thin wrapper.
We had problems with a bigram model for mixed CJK and English text (i.e. a tweet is mostly Japanese, but has a single english word). This is obvious in retrospect from looking at the math (Japanese has many more characters, so the probabilities of any given pair are low). I think you could solve this with some more complicated log-linear comparison, but I cheated and used a simple filter based on character sets that are unique to certain languages (i.e. if it only contains unified Han, then it's Chinese, if it contains some Japanese kana and unified Han, then it's Japanese).