Java中如何识别文本文档的语言?
是否有一个现有的 Java 库可以告诉我一个字符串是否包含英语文本(例如,我需要能够区分法语或意大利语文本 - 该函数需要为法语和意大利语返回 false,为英语返回 true) ?
Is there an existing Java library that could tell me whether a String contains English language text or not (e.g. I need to be able to distinguish French or Italian text -- the function needs to return false for French and Italian, and true for English)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
技术有很多种,而稳健的方法会结合各种技术:
您可以“松散解析”文本中指示特定语言的某些功能,例如,如果它包含与以下正则表达式匹配,您可以将此作为该语言是法语的有力线索:
\bvous\s+\p{L}+ez\b
为了帮助您开始,这里是英语、法语的常见三元组和字数统计和意大利语(从一些代码复制并粘贴 - 我将把它作为解析它们的练习):(
三元组计数是每百万个字符;单词计数是每百万个单词。“_”字符代表单词边界。)
我记得,《牛津计算语言学家手册》中引用了这些数字,并且基于报纸文章的样本。 如果您有这些语言的文本语料库,那么您自己就可以很容易地得出类似的数字。
如果您想要一种真正快速而肮脏的方法来应用上述内容,请尝试:
显然,这可以进行改进,但您可能会发现这个简单的解决方案足以满足您的需求,因为您本质上对“英语与否”感兴趣。
There are various techniques, and a robust method would combine various ones:
can you "loosely parse" certain features in the text that would indicate a particular language, e.g. if it contains a match to the following regular expression, you could take this as a strong clue that the language is French:
\bvous\s+\p{L}+ez\b
To get you started, here are frequent trigram and word counts for English, French and Italian (copied and pasted from some code-- I'll leave it as an exercise to parse them):
(Trigram counts are per million characters; word counts are per million words. The '_' character represents a word boundary.)
As I recall, the figures are cited in the Oxford Handbook of Computational Linguists and are based on a sample of newspaper articles. If you have a corpus of text in these languages, it's easy enough to derive similar figures yourself.
If you want a really quick-and-dirty way of applying the above, try:
Obviously, this can then be refined, but you might find that this simple solution is good enough for what you want, since you're essentially interested in "English or not".
您尝试过阿帕奇蒂卡吗? 它具有良好的 API 来检测语言,还可以通过加载各自的配置文件来支持不同的语言。
Did you tried Apache Tika. It has good API to detect language and It can also support different language by loading respective profile.
您可以尝试将每个单词与英语、法语或意大利语词典进行比较。 请记住,尽管某些单词可能会出现在多个词典中。
You could try comparing each word to an English, French, or Italian dictionary. Keep in mind though some words may appear in multiple dictionaries.
这是一篇讨论这个概念的有趣的博客文章。 这些示例是使用 Scala 编写的,但您应该能够将相同的一般概念应用到 Java 中。
Here's an interesting blog post that discusses this concept. The examples are in Scala, but you should be able to apply the same general concepts to Java.
如果您正在查看单个字符或单词,这是一个棘手的问题。 然而,由于您正在处理整个文档,因此可能还有一些希望。 不幸的是,我不知道现有的图书馆可以做到这一点。
一般来说,每种语言都需要一个相当全面的单词列表。 然后检查文档中的每个单词。 如果它出现在某种语言的词典中,请给该语言“投票”。 有些单词会以多种语言出现,有时一种语言的文档会使用另一种语言的借词,但文档不必很长,您就会看到一种非常明显的语言趋势。
一些最好的英语单词列表是由 Scrabble 玩家使用的。 这些列表可能也存在于其他语言中。 原始列表很难通过谷歌找到,但它们就在那里。
If you are looking at individual characters or words, this is a tough problem. Since you're working with a whole document, however, there might be some hope. Unfortunately, I don't know of an existing library to do this.
In general, one would need a fairly comprehensive word list for each language. Then examine each word in the document. If it appears in the dictionary for a language, give that language a "vote". Some words will appear in more than one language, and sometimes a document in one language will use loanwords from another language, but a document wouldn't have to be very long before you saw a very clear trend toward one language.
Some of the best word lists for English are those used by Scrabble players. These lists probably exist for other languages too. The raw lists can be hard to find via Google, but they are out there.
在我看来,没有“好的”方法可以做到这一点。 关于这个主题的所有答案都可能非常复杂。 最明显的部分是检查法语+意大利语字符而不是英语字符,然后返回 false。
但是,如果该单词是法语但没有特殊字符怎么办? 想象你有一个完整的句子。 你可以从字典中匹配每个单词,如果这个句子的法语点多于英语点,那么它就不是英语。 这将阻止法语、意大利语和英语中常见的单词。
祝你好运。
There's no "good" way of doing this imo. All answers can be very complicated on this topic. The obvious part is to check for characters that is in french + italian and not in english and then return false.
However, what if the word is french but has no special characters? Play with the thought you have a whole sentance. You could match each word from dictionaries and if the sentance has more french points than english points, it's not english. This will prevent the common words that french, italian and english have.
Good Luck.