Java中如何识别文本文档的语言?

发布于 2024-07-12 02:27:14 字数 98 浏览 8 评论 0原文

是否有一个现有的 Java 库可以告诉我一个字符串是否包含英语文本(例如,我需要能够区分法语或意大利语文本 - 该函数需要为法语和意大利语返回 false,为英语返回 true) ?

Is there an existing Java library that could tell me whether a String contains English language text or not (e.g. I need to be able to distinguish French or Italian text -- the function needs to return false for French and Italian, and true for English)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

面如桃花 2024-07-19 02:27:15

技术有很多种,而稳健的方法会结合各种技术:

  • 查看n 个字母组的频率(例如,3 个字母的组或三元组< /b>)在您的文本中,看看它们是否与您正在测试的语言中发现的频率相似
  • 查看给定语言中的频繁单词实例是否与您文本中发现的频率相匹配(这对于较长的文本来说效果更好)
  • 文本是否包含字符,这些字符强烈地将其范围缩小到特定语言? (例如,如果文本包含颠倒的问号,则很可能是西班牙语)
  • 您可以“松散解析”文本中指示特定语言的某些功能,例如,如果它包含与以下正则表达式匹配,您可以将此作为该语言是法语的有力线索:

    \bvous\s+\p{L}+ez\b

为了帮助您开始,这里是英语、法语的常见三元组和字数统计和意大利语(从一些代码复制并粘贴 - 我将把它作为解析它们的练习):(

  Locale.ENGLISH,
      "he_=38426;the=38122;nd_=20901;ed_=20519;and=18417;ing=16248;to_=15295;ng_=15281;er_=15192;at_=14219",
      "the=11209;and=6631;to=5763;of=5561;a=5487;in=3421;was=3214;his=2313;that=2311;he=2115",
  Locale.FRENCH,
      "es_=38676;de_=28820;ent=21451;nt_=21072;e_d=18764;le_=17051;ion=15803;s_d=15491;e_l=14888;la_=14260",
      "de=10726;la=5581;le=3954;" + ((char)224) + "=3930;et=3563;des=3295;les=3277;du=2667;en=2505;un=1588",
  Locale.ITALIAN,
      "re_=7275;la_=7251;to_=7208;_di=7170;_e_=7031;_co=5919;che=5876;he_=5622;no_=5546;di_=5460",
      "di=7014;e=4045;il=3313;che=3006;la=2943;a=2541;in=2434;per=2165;del=2013;un=1945",

三元组计数是每百万个字符;单词计数是每百万个单词。“_”字符代表单词边界。)

我记得,《牛津计算语言学家手册》中引用了这些数字,并且基于报纸文章的样本。 如果您有这些语言的文本语料库,那么您自己就可以很容易地得出类似的数字。

如果您想要一种真正快速而肮脏的方法来应用上述内容,请尝试:

  • 考虑文本中三个字符的每个序列(用“_”替换单词边界)
  • 对于与给定语言的常用三元组之一匹配的每个三元组, 将该语言的“分数”增加 1(更复杂的是,您可以根据列表中的位置进行加权)
  • ,最后
  • ,假设该语言是分数最高的语言(可选),对常用单词执行相同的操作(合并分数)

显然,这可以进行改进,但您可能会发现这个简单的解决方案足以满足您的需求,因为您本质上对“英语与否”感兴趣。

There are various techniques, and a robust method would combine various ones:

  • look at the frequencies of groups of n letters (say, groups of 3 letters or trigrams) in your text and see if they are similar to the frequencies found for the language you are testing against
  • look at whether the instances of frequent words in the given language match the freuencies found in your text (this tends to work better for longer texts)
  • does the text contain characters which strongly narrow it down to a particular language? (e.g. if the text contains an upside down question mark there's a good chance it's Spanish)
  • can you "loosely parse" certain features in the text that would indicate a particular language, e.g. if it contains a match to the following regular expression, you could take this as a strong clue that the language is French:

    \bvous\s+\p{L}+ez\b

To get you started, here are frequent trigram and word counts for English, French and Italian (copied and pasted from some code-- I'll leave it as an exercise to parse them):

  Locale.ENGLISH,
      "he_=38426;the=38122;nd_=20901;ed_=20519;and=18417;ing=16248;to_=15295;ng_=15281;er_=15192;at_=14219",
      "the=11209;and=6631;to=5763;of=5561;a=5487;in=3421;was=3214;his=2313;that=2311;he=2115",
  Locale.FRENCH,
      "es_=38676;de_=28820;ent=21451;nt_=21072;e_d=18764;le_=17051;ion=15803;s_d=15491;e_l=14888;la_=14260",
      "de=10726;la=5581;le=3954;" + ((char)224) + "=3930;et=3563;des=3295;les=3277;du=2667;en=2505;un=1588",
  Locale.ITALIAN,
      "re_=7275;la_=7251;to_=7208;_di=7170;_e_=7031;_co=5919;che=5876;he_=5622;no_=5546;di_=5460",
      "di=7014;e=4045;il=3313;che=3006;la=2943;a=2541;in=2434;per=2165;del=2013;un=1945",

(Trigram counts are per million characters; word counts are per million words. The '_' character represents a word boundary.)

As I recall, the figures are cited in the Oxford Handbook of Computational Linguists and are based on a sample of newspaper articles. If you have a corpus of text in these languages, it's easy enough to derive similar figures yourself.

If you want a really quick-and-dirty way of applying the above, try:

  • consider each sequence of three characters in your text (replacing word boundaries with '_')
  • for each trigram that matches one of the frequent ones for the given language, increment that language's "score" by 1 (more sophisticatedly, you could weight according to the position in the list)
  • at the end, assume the language is that with the highest score
  • optionally, do the same for the common words (combine scores)

Obviously, this can then be refined, but you might find that this simple solution is good enough for what you want, since you're essentially interested in "English or not".

青衫负雪 2024-07-19 02:27:15

您尝试过阿帕奇蒂卡吗? 它具有良好的 API 来检测语言,还可以通过加载各自的配置文件来支持不同的语言。

Did you tried Apache Tika. It has good API to detect language and It can also support different language by loading respective profile.

失与倦" 2024-07-19 02:27:15

您可以尝试将每个单词与英语、法语或意大利语词典进行比较。 请记住,尽管某些单词可能会出现在多个词典中。

You could try comparing each word to an English, French, or Italian dictionary. Keep in mind though some words may appear in multiple dictionaries.

心在旅行 2024-07-19 02:27:15

这是一篇讨论这个概念的有趣的博客文章。 这些示例是使用 Scala 编写的,但您应该能够将相同的一般概念应用到 Java 中。

Here's an interesting blog post that discusses this concept. The examples are in Scala, but you should be able to apply the same general concepts to Java.

以往的大感动 2024-07-19 02:27:15

如果您正在查看单个字符或单词,这是一个棘手的问题。 然而,由于您正在处理整个文档,因此可能还有一些希望。 不幸的是,我不知道现有的图书馆可以做到这一点。

一般来说,每种语言都需要一个相当全面的单词列表。 然后检查文档中的每个单词。 如果它出现在某种语言的词典中,请给该语言“投票”。 有些单词会以多种语言出现,有时一种语言的文档会使用另一种语言的借词,但文档不必很长,您就会看到一种非常明显的语言趋势。

一些最好的英语单词列表是由 Scrabble 玩家使用的。 这些列表可能也存在于其他语言中。 原始列表很难通过谷歌找到,但它们就在那里。

If you are looking at individual characters or words, this is a tough problem. Since you're working with a whole document, however, there might be some hope. Unfortunately, I don't know of an existing library to do this.

In general, one would need a fairly comprehensive word list for each language. Then examine each word in the document. If it appears in the dictionary for a language, give that language a "vote". Some words will appear in more than one language, and sometimes a document in one language will use loanwords from another language, but a document wouldn't have to be very long before you saw a very clear trend toward one language.

Some of the best word lists for English are those used by Scrabble players. These lists probably exist for other languages too. The raw lists can be hard to find via Google, but they are out there.

梓梦 2024-07-19 02:27:15

在我看来,没有“好的”方法可以做到这一点。 关于这个主题的所有答案都可能非常复杂。 最明显的部分是检查法语+意大利语字符而不是英语字符,然后返回 false。

但是,如果该单词是法语但没有特殊字符怎么办? 想象你有一个完整的句子。 你可以从字典中匹配每个单词,如果这个句子的法语点多于英语点,那么它就不是英语。 这将阻止法语、意大利语和英语中常见的单词。

祝你好运。

There's no "good" way of doing this imo. All answers can be very complicated on this topic. The obvious part is to check for characters that is in french + italian and not in english and then return false.

However, what if the word is french but has no special characters? Play with the thought you have a whole sentance. You could match each word from dictionaries and if the sentance has more french points than english points, it's not english. This will prevent the common words that french, italian and english have.

Good Luck.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文