... part of the Azure Cognitive Services API collection of machine
learning and AI algorithms in the cloud, and is readily consumable in
your development projects
Here's a quickstart guide on how to detect language from text using this API
This package implements several algorithms for language
identification, and includes two sets of pre-compiled language
profiles. One set covers 52 languages and was trained on Wikipedia
(i.e. a well-written corpus); the other covers 26 languages and was
constructed from Twitter (i.e. a highly colloquial corpus). The
language identifiers are packaged up as a C# library, and be easily
embedded into other C# projects.
We can use Regex.IsMatch(text, "[\\uxxxx-\\uxxxx]+") to detect an specific language. Here xxxx is the 4 digit Unicode id of a character.
To detect Arabic:
Make a statistical analyses of the string: Split the string into words. Get a dictionary for every language you want to test for. And then find the language that has the highest word count.
In C# every string in memory will be unicode, and is not encoded. Also in text files the encoding is not stored. (Sometimes only an indication of 8-bit or 16-bit).
If you want to make a distinction between two languages, you might find some simple tricks. For example if you want to recognize English from Dutch, the string that contains the "y" is mostly English. (Unreliable but fast).
您可能会喜欢看一下 Frengly - 这是 Google 翻译服务的一个不错的 UI,它会尝试猜测文本的语言输入文本...
If you mean the natural (ie human) language, this is in general a Hard Problem. What language is "server" - English or Turkish? What language is "chat" - English or French? What language is "uno" - Italian or Spanish (or Latin!) ?
Without paying attention to context, and doing some hard natural language processing (<----- this is the phrase to google for) you haven't got a chance.
You might enjoy a look at Frengly - it's a nice UI onto the Google Translate service which attempts to guess the language of the input text...
A statistical approach using digraphs or trigraphs is a very good indicator. For example, here are the most common digraphs in English in order: http://www.letterfrequency.org/#digraph-frequency (one can find better or more complete lists). This method may have a better success rate than word analysis for short snippets of text because there are more digraphs in text than there are complete words.
Currently the best way seems to use classifiers trained to classify piece of text into one (or more) of languages from predefined set.
There is a Perl tool called TextCat. It has language models for 74 most popular languages. There is a huge number of ports of this tool into different programming languages.
var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
if (!result.error) {
var language = 'unknown';
for (l in google.language.Languages) {
if (google.language.Languages[l] == result.language) {
language = l;
break;
}
}
var container = document.getElementById("detection");
container.innerHTML = text + " is: " + language + "";
}
});
var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
if (!result.error) {
var language = 'unknown';
for (l in google.language.Languages) {
if (google.language.Languages[l] == result.language) {
language = l;
break;
}
}
var container = document.getElementById("detection");
container.innerHTML = text + " is: " + language + "";
}
});
And, since you are using c#, take a look at this article on how to call the API from c#.
UPDATE:
That c# link is gone, here's a cached copy of the core of it:
string s = TextBoxTranslateEnglishToHebrew.Text;
string key = "YOUR GOOGLE AJAX API KEY";
GoogleLangaugeDetector detector =
new GoogleLangaugeDetector(s, VERSION.ONE_POINT_ZERO, key);
GoogleTranslator gTranslator = new GoogleTranslator(s, VERSION.ONE_POINT_ZERO,
detector.LanguageDetected.Equals("iw") ? LANGUAGE.HEBREW : LANGUAGE.ENGLISH,
detector.LanguageDetected.Equals("iw") ? LANGUAGE.ENGLISH : LANGUAGE.HEBREW,
key);
TextBoxTranslation.Text = gTranslator.Translation;
Basically, you need to create a URI and send it to Google that looks like:
发布评论
评论(9)
一种替代方法是使用“文本翻译 API< /a>' 这是
这里有快速入门指南 如何使用此 API 从文本中检测语言
One alternative is to use 'Translator Text API' which is
Here's a quickstart guide on how to detect language from text using this API
您可以使用C# 包进行语言识别 来自微软研究院:
从上面的链接下载该包。
You may use the C# package for language identification from Microsoft Research:
Download the package from the above link.
我们可以使用 Regex.IsMatch(text, "[\\uxxxx-\\uxxxx]+") 来检测特定语言。 这里 xxxx 是字符的 4 位 Unicode id。
要检测阿拉伯语:
We can use
Regex.IsMatch(text, "[\\uxxxx-\\uxxxx]+")
to detect an specific language. Here xxxx is the 4 digit Unicode id of a character.To detect Arabic:
来自 Google Chromium 浏览器的 CLD3(紧凑语言检测器 v3) 库
您可以封装 CLD3 库< /a>,这是用 C++ 编写的。
CLD3 (Compact Language Detector v3) library from Google's Chromium browser
You could wrap the CLD3 library, which is written in C++.
对字符串进行统计分析:将字符串拆分为单词。 为您想要测试的每种语言准备一本字典。 然后找到字数最多的语言。
在 C# 中,内存中的每个字符串都将是 unicode,并且不会进行编码。 此外,在文本文件中,不会存储编码。 (有时仅指示8位或16位)。
如果您想区分两种语言,您可能会发现一些简单的技巧。 例如,如果您想从荷兰语中识别英语,则包含“y”的字符串主要是英语。 (不可靠但速度快)。
Make a statistical analyses of the string: Split the string into words. Get a dictionary for every language you want to test for. And then find the language that has the highest word count.
In C# every string in memory will be unicode, and is not encoded. Also in text files the encoding is not stored. (Sometimes only an indication of 8-bit or 16-bit).
If you want to make a distinction between two languages, you might find some simple tricks. For example if you want to recognize English from Dutch, the string that contains the "y" is mostly English. (Unreliable but fast).
如果您指的是自然(即人类)语言,那么这通常是一个难题。 “服务器”是什么语言 - 英语还是土耳其语? “聊天”是什么语言——英语还是法语? “uno”是什么语言 - 意大利语还是西班牙语(或拉丁语!)?
如果不注意上下文,并进行一些困难的自然语言处理(<-----这是谷歌搜索的短语),你就没有机会。
您可能会喜欢看一下 Frengly - 这是 Google 翻译服务的一个不错的 UI,它会尝试猜测文本的语言输入文本...
If you mean the natural (ie human) language, this is in general a Hard Problem. What language is "server" - English or Turkish? What language is "chat" - English or French? What language is "uno" - Italian or Spanish (or Latin!) ?
Without paying attention to context, and doing some hard natural language processing (<----- this is the phrase to google for) you haven't got a chance.
You might enjoy a look at Frengly - it's a nice UI onto the Google Translate service which attempts to guess the language of the input text...
使用二字母或三字母的统计方法是一个非常好的指标。 例如,以下是按顺序排列的最常见的英语二合字母: http://www.letterFrequency.org/ #digraph-Frequency(可以找到更好或更完整的列表)。 对于短文本片段,此方法可能比单词分析具有更高的成功率,因为文本中的二合字母多于完整的单词。
A statistical approach using digraphs or trigraphs is a very good indicator. For example, here are the most common digraphs in English in order: http://www.letterfrequency.org/#digraph-frequency (one can find better or more complete lists). This method may have a better success rate than word analysis for short snippets of text because there are more digraphs in text than there are complete words.
快速回答 NTextCat (NuGet,在线演示)
长答案:
目前最好的方法似乎是使用经过训练的分类器将一段文本从预定义的语言中分类为一种(或多种)语言放。
有一个名为 TextCat 的 Perl 工具。 它拥有 74 种最流行语言的语言模型。 该工具有大量端口可以移植到不同的编程语言中。
.Net 中没有端口。 所以我写了一个:NTextCat on GitHub。
它是纯.NET Framework DLL + 命令行接口。 默认情况下,它使用 14 种语言的配置文件。
非常感谢任何反馈!
新的想法和功能请求也受到欢迎:)
另一种选择是使用多种在线服务(例如提到的 Google 服务、Detectlanguage.com、langid.net 等)。
Fast answer: NTextCat (NuGet, Online Demo)
Long answer:
Currently the best way seems to use classifiers trained to classify piece of text into one (or more) of languages from predefined set.
There is a Perl tool called TextCat. It has language models for 74 most popular languages. There is a huge number of ports of this tool into different programming languages.
There were no ports in .Net. So I have written one: NTextCat on GitHub.
It is pure .NET Framework DLL + command line interface to it. By default, it uses a profile of 14 languages.
Any feedback is very appreciated!
New ideas and feature requests are welcomed too :)
Alternative is to use numerous online services (e.g. one from Google mentioned, detectlanguage.com, langid.net, etc.).
如果您的代码上下文可以访问互联网,您可以尝试使用 Google API 进行语言检测。
http://code.google.com/apis/ajaxlanguage/documentation/
并且,由于您使用的是 c#,请查看 这篇文章介绍如何从 C# 调用 API。
更新:
那个 c# 链接消失了,这是其核心的缓存副本:
基本上,您需要创建一个 URI 并将其发送到 Google,如下所示:
这告诉 API 您想要将“hello world”从英语翻译成希伯来语,Google 的 JSON 响应如下所示:
我选择创建一个代表典型 Google JSON 响应的基类:
然后,从此类继承的 Translation 对象:
此 Translation 类有一个TranslationResponseData 对象如下所示:
最后,我们可以创建 GoogleTranslator 类:
If the context of your code have internet access, you can try to use the Google API for language detection.
http://code.google.com/apis/ajaxlanguage/documentation/
And, since you are using c#, take a look at this article on how to call the API from c#.
UPDATE:
That c# link is gone, here's a cached copy of the core of it:
Basically, you need to create a URI and send it to Google that looks like:
This tells the API that you want to translate "hello world" from English to Hebrew, to which Google's JSON response would look like:
I chose to make a base class that represents a typical Google JSON response:
Then, a Translation object that inherits from this class:
This Translation class has a TranslationResponseData object that looks like this:
Finally, we can make the GoogleTranslator class: