语言检测如何工作?

发布于 2024-12-08 19:25:30 字数 209 浏览 1 评论 0原文

一段时间以来,我一直想知道谷歌翻译(或者可能是假设的翻译者)如何从“来自”字段中输入的字符串中检测语言。我一直在考虑这个问题,我唯一能想到的就是在输入字符串中寻找某种语言特有的单词。另一种方法可能是检查句子结构或除关键字之外的其他语义。但考虑到不同的语言及其语义,这似乎是一项非常困难的任务。我做了一些研究,发现有一些方法可以使用 n-gram 序列并使用一些统计模型来检测语言。也希望得到高水平的答案。

I have been wondering for some time how does Google translate(or maybe a hypothetical translator) detect language from the string entered in the "from" field. I have been thinking about this and only thing I can think of is looking for words that are unique to a language in the input string. The other way could be to check sentence formation or other semantics in addition to keywords. But this seems to be a very difficult task considering different languages and their semantics. I did some research to find that there are ways that use n-gram sequences and use some statistical models to detect language. Would appreciate a high level answer too.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

み青杉依旧 2024-12-15 19:25:30

以英文维基百科为例。检查字母“a”之后出现“b”(例如)的概率是多少,并对所有字母组合执行此操作,最终将得到一个概率矩阵。

如果您对不同语言的维基百科执行相同的操作,您将获得每种语言的不同矩阵。

要检测语言,只需使用所有这些矩阵并将概率用作分数,假设在英语中您将得到以下概率:

t->h = 0.3 h->e = .2

在西班牙语矩阵中,您 将得到以下概率:会得到

t->h = 0.01 h->e = .3

使用英语矩阵,单词“the”会给你 0.3+0.2 = 的分数0.5
并使用西班牙语矩阵:0.01+0.3 = 0.31

英语矩阵获胜,因此必须是英语。

Take the Wikipedia in English. Check what is the probability that after the letter 'a' comes a 'b' (for example) and do that for all the combination of letters, you will end up with a matrix of probabilities.

If you do the same for the Wikipedia in different languages you will get different matrices for each language.

To detect the language just use all those matrices and use the probabilities as a score, let say that in English you'd get this probabilities:

t->h = 0.3 h->e = .2

and in the Spanish matrix you'd get that

t->h = 0.01 h->e = .3

The word 'the', using the English matrix, would give you a score of 0.3+0.2 = 0.5
and using the Spanish one: 0.01+0.3 = 0.31

The English matrix wins so that has to be English.

情定在深秋 2024-12-15 19:25:30

如果您想用您选择的编程语言实现轻量级语言猜测器,您可以使用“Cavnar and Trenkle '94:N-Gram-Based Text Categorization”的方法。你可以在谷歌学术上找到这篇论文,它非常简单。

他们的方法为每种语言构建了一个 N-Gram 统计数据,之后应该能够从该语言的某些文本中进行猜测。然后,同样针对未知文本构建此类统计数据,并通过简单的异地测量与之前训练的统计数据进行比较。
如果您使用 Unigrams+Bigrams(可能+Trigrams)并比较 100-200 个最常见的 N-Grams,如果要猜测的文本不太短,您的命中率应该超过 95%。
此处提供了一个演示,但目前似乎无法使用。

语言猜测还有其他方法,包括计算 N-Gram 的概率和更高级的分类器,但在大多数情况下,Cavnar 和 Trenkle 的方法应该足够执行。

If you want to implement a lightweight language guesser in the programming language of your choice you can use the method of 'Cavnar and Trenkle '94: N-Gram-Based Text Categorization'. You can find the Paper on Google Scholar and it is pretty straight forward.

Their method builds a N-Gram statistic for every language it should be able to guess afterwards from some text in that language. Then such statistic is build for the unknown text aswell and compared to the previously trained statistics by a simple out-of-place measure.
If you use Unigrams+Bigrams (possibly +Trigrams) and compare the 100-200 most frequent N-Grams your hit rate should be over 95% if the text to guess is not too short.
There was a demo available here but it doesn't seem to work at the moment.

There are other ways of Language Guessing including computing the probability of N-Grams and more advanced classifiers, but in the most cases the approach of Cavnar and Trenkle should perform sufficiently.

筱武穆 2024-12-15 19:25:30

您不必对文本进行深入分析即可了解它所用的语言。统计数据告诉我们,每种语言都有特定的字符模式和频率。这是一个非常好的一阶近似。当文本采用多种语言时,情况会变得更糟,但它仍然不是非常复杂的东西。
当然,如果文本太短(例如单个单词,更糟糕的是单个短单词),统计就不起作用,你需要一本字典。

You don't have to do deep analysis of text to have an idea of what language it's in. Statistics tells us that every language has specific character patterns and frequencies. That's a pretty good first-order approximation. It gets worse when the text is in multiple languages, but still it's not something extremely complex.
Of course, if the text is too short (e.g. a single word, worse, a single short word), statistics doesn't work, you need a dictionary.

鸩远一方 2024-12-15 19:25:30

一个实现示例。

Mathematica 非常适合实现这一点。它识别(即有几个字典)以下语言中的单词:

dicts = DictionaryLookup[All]
{"Arabic", "BrazilianPortuguese", "Breton", "BritishEnglish", \
"Catalan", "Croatian", "Danish", "Dutch", "English", "Esperanto", \
"Faroese", "Finnish", "French", "Galician", "German", "Hebrew", \
"Hindi", "Hungarian", "IrishGaelic", "Italian", "Latin", "Polish", \
"Portuguese", "Russian", "ScottishGaelic", "Spanish", "Swedish"}

我构建了一个小而简单的函数来计算每种语言中句子的概率:

f[text_] := 
 SortBy[{#[[1]], #[[2]] / Length@k} & /@ (Tally@(First /@ 
       Flatten[DictionaryLookup[{All, #}] & /@ (k = 
           StringSplit[text]), 1])), -#[[2]] &]

这样,只需在字典中查找单词,您就可以得到一个很好的近似值,也适用于短句:

f["we the people"]
{{BritishEnglish,1},{English,1},{Polish,2/3},{Dutch,1/3},{Latin,1/3}}

f["sino yo triste y cuitado que vivo en esta prisión"]
{{Spanish,1},{Portuguese,7/10},{Galician,3/5},... }

f["wszyscy ludzie rodzą się wolni"]
{{"Polish", 3/5}}

f["deutsch lernen mit jetzt"]
{{"German", 1}, {"Croatian", 1/4}, {"Danish", 1/4}, ...}

An implementation example.

Mathematica is a good fit for implementing this. It recognizes (ie has several dictionaries) words in the following languages:

dicts = DictionaryLookup[All]
{"Arabic", "BrazilianPortuguese", "Breton", "BritishEnglish", \
"Catalan", "Croatian", "Danish", "Dutch", "English", "Esperanto", \
"Faroese", "Finnish", "French", "Galician", "German", "Hebrew", \
"Hindi", "Hungarian", "IrishGaelic", "Italian", "Latin", "Polish", \
"Portuguese", "Russian", "ScottishGaelic", "Spanish", "Swedish"}

I built a little and naive function to calculate the probability of a sentence in each of those languages:

f[text_] := 
 SortBy[{#[[1]], #[[2]] / Length@k} & /@ (Tally@(First /@ 
       Flatten[DictionaryLookup[{All, #}] & /@ (k = 
           StringSplit[text]), 1])), -#[[2]] &]

So that, just looking for words in dictionaries, you may get a good approximation, also for short sentences:

f["we the people"]
{{BritishEnglish,1},{English,1},{Polish,2/3},{Dutch,1/3},{Latin,1/3}}

f["sino yo triste y cuitado que vivo en esta prisión"]
{{Spanish,1},{Portuguese,7/10},{Galician,3/5},... }

f["wszyscy ludzie rodzą się wolni"]
{{"Polish", 3/5}}

f["deutsch lernen mit jetzt"]
{{"German", 1}, {"Croatian", 1/4}, {"Danish", 1/4}, ...}
同尘 2024-12-15 19:25:30

您可能对用于书面语言识别的 WiLI 基准数据集感兴趣。您还可以在本文中找到高级答案如下:

  • 清理文本:删除您不想要/需要的内容;通过应用范式使 unicode 明确。
  • 特征提取:计算 n 元语法,创建 tf-idf 特征。类似的东西
  • 在特征上训练分类器:神经网络、SVM、朴素贝叶斯……任何你认为可行的东西。

You might be interested in The WiLI benchmark dataset for written language identification. The high level-answer you can also find in the paper is the following:

  • Clean the text: Remove things you don't want / need; make unicode un-ambiguious by applying a normal form.
  • Feature Extraction: Count n-grams, create tf-idf features. Something like that
  • Train a classifier on the features: Neural networks, SVMs, Naive Bayes, ... whatever you think could work.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文