当前位置：文江博客话题详情

Java中如何识别文本文档的语言？

发布于 2024-07-12 02:27:14 字数 98 浏览 8 评论 0原文

是否有一个现有的 Java 库可以告诉我一个字符串是否包含英语文本（例如，我需要能够区分法语或意大利语文本 - 该函数需要为法语和意大利语返回 false，为英语返回 true）？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

面如桃花 2024-07-19 02:27:15

技术有很多种，而稳健的方法会结合各种技术：

查看n 个字母组的频率（例如，3 个字母的组或三元组< /b>）在您的文本中，看看它们是否与您正在测试的语言中发现的频率相似
查看给定语言中的频繁单词实例是否与您文本中发现的频率相匹配（这对于较长的文本来说效果更好）
文本是否包含字符，这些字符强烈地将其范围缩小到特定语言？（例如，如果文本包含颠倒的问号，则很可能是西班牙语）
您可以“松散解析”文本中指示特定语言的某些功能，例如，如果它包含与以下正则表达式匹配，您可以将此作为该语言是法语的有力线索：
\bvous\s+\p{L}+ez\b

为了帮助您开始，这里是英语、法语的常见三元组和字数统计和意大利语（从一些代码复制并粘贴 - 我将把它作为解析它们的练习）：（

  Locale.ENGLISH,
      "he_=38426;the=38122;nd_=20901;ed_=20519;and=18417;ing=16248;to_=15295;ng_=15281;er_=15192;at_=14219",
      "the=11209;and=6631;to=5763;of=5561;a=5487;in=3421;was=3214;his=2313;that=2311;he=2115",
  Locale.FRENCH,
      "es_=38676;de_=28820;ent=21451;nt_=21072;e_d=18764;le_=17051;ion=15803;s_d=15491;e_l=14888;la_=14260",
      "de=10726;la=5581;le=3954;" + ((char)224) + "=3930;et=3563;des=3295;les=3277;du=2667;en=2505;un=1588",
  Locale.ITALIAN,
      "re_=7275;la_=7251;to_=7208;_di=7170;_e_=7031;_co=5919;che=5876;he_=5622;no_=5546;di_=5460",
      "di=7014;e=4045;il=3313;che=3006;la=2943;a=2541;in=2434;per=2165;del=2013;un=1945",

三元组计数是每百万个字符；单词计数是每百万个单词。“_”字符代表单词边界。）

我记得，《牛津计算语言学家手册》中引用了这些数字，并且基于报纸文章的样本。如果您有这些语言的文本语料库，那么您自己就可以很容易地得出类似的数字。

如果您想要一种真正快速而肮脏的方法来应用上述内容，请尝试：

考虑文本中三个字符的每个序列（用“_”替换单词边界）
对于与给定语言的常用三元组之一匹配的每个三元组，将该语言的“分数”增加 1（更复杂的是，您可以根据列表中的位置进行加权）
，最后
，假设该语言是分数最高的语言（可选），对常用单词执行相同的操作（合并分数）

显然，这可以进行改进，但您可能会发现这个简单的解决方案足以满足您的需求，因为您本质上对“英语与否”感兴趣。

There are various techniques, and a robust method would combine various ones:

look at the frequencies of groups of n letters (say, groups of 3 letters or trigrams) in your text and see if they are similar to the frequencies found for the language you are testing against
look at whether the instances of frequent words in the given language match the freuencies found in your text (this tends to work better for longer texts)
does the text contain characters which strongly narrow it down to a particular language? (e.g. if the text contains an upside down question mark there's a good chance it's Spanish)
can you "loosely parse" certain features in the text that would indicate a particular language, e.g. if it contains a match to the following regular expression, you could take this as a strong clue that the language is French:
\bvous\s+\p{L}+ez\b

To get you started, here are frequent trigram and word counts for English, French and Italian (copied and pasted from some code-- I'll leave it as an exercise to parse them):

  Locale.ENGLISH,
      "he_=38426;the=38122;nd_=20901;ed_=20519;and=18417;ing=16248;to_=15295;ng_=15281;er_=15192;at_=14219",
      "the=11209;and=6631;to=5763;of=5561;a=5487;in=3421;was=3214;his=2313;that=2311;he=2115",
  Locale.FRENCH,
      "es_=38676;de_=28820;ent=21451;nt_=21072;e_d=18764;le_=17051;ion=15803;s_d=15491;e_l=14888;la_=14260",
      "de=10726;la=5581;le=3954;" + ((char)224) + "=3930;et=3563;des=3295;les=3277;du=2667;en=2505;un=1588",
  Locale.ITALIAN,
      "re_=7275;la_=7251;to_=7208;_di=7170;_e_=7031;_co=5919;che=5876;he_=5622;no_=5546;di_=5460",
      "di=7014;e=4045;il=3313;che=3006;la=2943;a=2541;in=2434;per=2165;del=2013;un=1945",

(Trigram counts are per million characters; word counts are per million words. The '_' character represents a word boundary.)

As I recall, the figures are cited in the Oxford Handbook of Computational Linguists and are based on a sample of newspaper articles. If you have a corpus of text in these languages, it's easy enough to derive similar figures yourself.

If you want a really quick-and-dirty way of applying the above, try:

consider each sequence of three characters in your text (replacing word boundaries with '_')
for each trigram that matches one of the frequent ones for the given language, increment that language's "score" by 1 (more sophisticatedly, you could weight according to the position in the list)
at the end, assume the language is that with the highest score
optionally, do the same for the common words (combine scores)

Obviously, this can then be refined, but you might find that this simple solution is good enough for what you want, since you're essentially interested in "English or not".

回复收藏 0 原文