从 PHP 字符串中检测语言
在PHP中,有没有办法检测字符串的语言?假设字符串是 UTF-8 格式。
In PHP, is there a way to detect the language of a string? Suppose the string is in UTF-8 format.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(19)
从 PHP 5.1 开始,我使用这种方法来检查非英语、西班牙语、法语字符,严格使用 PHP,没有任何额外的语言 API 或类。语言脚本列表来自: https://www.php.net /manual/en/regexp.reference.unicode.php 请参阅下文
一项改进是向 PHP 添加一个函数,列出所有支持的脚本语言,这样您就不必手动填写数组。
该用例用于阻止非拉丁语帖子发送到表单,以提高其垃圾邮件阻止能力,因为该表单收到了大量俄语、中文和阿拉伯语垃圾邮件帖子。自从实施以来,每周的数量从 40000 人减少到不足 5 人,而且最近 3 周内没有人。谷歌重新验证码正在使用,但它很容易被击败。 #使满意
I used this method to check for non- english, spanish, french chars using strictly PHP without any extra language API or Classes as of PHP 5.1. The language scripts list comes from: https://www.php.net/manual/en/regexp.reference.unicode.php See below
An improvement would be to add a function to PHP that lists all supported script languages so that you dont have to fill in the array by hand.
The usecase was for blocking non-latin posts to a form to improve it's spam blocking as the form was receiving a lot of russian, chinese, and arabic spam posts. Since this was implemented, its gone from 40000/week to less than 5, with none in the last 3 weeks. Google Re-Captcha was in use but it was being defeated easily. #satisfied
您可以使用 Java 实现 Apache Tika 的模块,将结果插入到 txt 文件、数据库等中,然后使用 php 从文件、数据库等中读取。
如果您没有那么多内容,您可以使用 Google 的 API,但请记住您的调用将受到限制,并且您只能向 API 发送有限数量的字符。在撰写本文时,我已经完成了 API 的版本 1(结果不太准确)和实验室版本 2(在得知每天有 100,000 个字符的上限后我放弃了)的测试。
You could implement a module of Apache Tika with Java, insert the results into a txt file, a DB, etc and then read from the file, db, whatever with php.
If you don't have that much content, you could use Google's API, although keep in mind your calls will be limited, and you can only send a restricted number of characters to the API. At the time of writing I'd finished testing version 1 (which turned out to be not so accurate) and the labs version 2 (i ditched after i read that there's a 100,000 chars cap per day) of the API.
以下代码不需要任何 api 或巨大的依赖项。在此代码中,我们删除所有符号、html 标签(如果您正在使用 html)、html 实体和空格。
对于剩余的文本,我们检查英语字符数与非英语字符数。如果英文字符的数量大于非英文字符的数量,我们将其标记为英文字符串。
Following code doesn't need any api or huge dependencies. In this code we remove all symbols , html tags (in case you are working with html), html entities and spaces.
With the remaining text we check number of english characters vs number of non english characters. If number of english characters are grater than number of non english characters we mark it as english string.
我使用了 Text_LanguageDetect pear 包 并取得了一些合理的结果。它使用起来非常简单,并且有一个适度的 52 种语言数据库。缺点是无法检测东亚语言。
结果是:
I've used the Text_LanguageDetect pear package with some reasonable results. It's dead simple to use, and it has a modest 52 language database. The downside is no detection of Eastern Asian languages.
results in:
我知道这是一篇旧文章,但这是我在找不到任何可行的解决方案后开发的内容。
该解决方案使用语言中最常见的 20 个单词,计算这些单词在大海捞针中的出现次数。然后它只是比较计数第一和第二多的语言的计数。如果亚军人数少于冠军人数的10%,则冠军全部获得。
代码 - 非常欢迎任何有关速度改进的建议!
I know this is an old post, but here is what I developed after not finding any viable solution.
The solution uses the 20 most common words in a language, counts the occurrences of those in the haystack. Then it just compares the counts of the first and second most counted languages. If the runner-up number is less than 10% of the winner, the winner takes it all.
Code - Any suggestions for speed improvement are more than welcome!
您无法从字符类型检测语言。并且没有万无一失的方法可以做到这一点。
使用任何方法,您都只是进行有根据的猜测。有一些与数学相关的文章 在那里
You can not detect the language from the character type. And there are no foolproof ways to do this.
With any method, you're just doing an educated guess. There are available some math related articles out there
您可以使用
Google 的 AJAX 语言 API(现已不复存在)。您可以自动检测字符串的语言
,并翻译以
支持的语言(也已失效)You could do this entirely client side with
Google's AJAX Language API(now defunct).You can detect automatically a string's language
And translate any string written in one of the
supported languages(also defunct)由于 Google Translate API 即将作为免费服务关闭,您可以尝试这个免费的替代方案,它是 Google Translate API 的替代品:
http://detectlanguage.com
As Google Translate API is going closing down as a free service, you can try this free alternative, which is a replacement for Google Translate API:
http://detectlanguage.com
Text_LanguageDetect pear 包产生了可怕的结果:“市中心豪华公寓”被检测为葡萄牙语...
Google API 仍然是最好的解决方案,他们提供 300 美元的免费信用并在向您收取任何费用之前发出警告
下面是一个超级简单的函数,使用 file_get_contents 进行下载API 检测到的语言,因此无需下载或安装库等。
执行:
您可以在此处获取 Google Translate API 密钥:https://console.cloud.google.com/apis/library/translate.googleapis.com/
这是获取短语的简单示例你去吧。对于更复杂的应用程序,您显然需要限制 API 密钥并使用该库。
Text_LanguageDetect pear package produced terrible results: "luxury apartments downtown" is detected as Portuguese...
Google API is still the best solution, they give 300$ free credit and warn before charging you anything
Below is a super simple function that uses file_get_contents to download the lang detected by the API, so no need to download or install libraries etc.
Execute:
You can get your Google Translate API key here: https://console.cloud.google.com/apis/library/translate.googleapis.com/
This is a simple example for short phrases to get you going. For more complex applications you'll want to restrict your API key and use the library obviously.
我尝试了 Text_LanguageDetect 库,但得到的结果不是很好(例如,文本“test”被识别为爱沙尼亚语而不是英语)。
我建议您尝试 Yandex Translate API,它免费,只需 1 次24 小时内可容纳 1000 万个字符,每月最多可容纳 1000 万个字符。自 2020 年 5 月 27 日起,不再颁发免费 API 密钥。
它支持(根据文档)60 多种语言。
I tried the Text_LanguageDetect library and the results I got were not very good (for instance, the text "test" was identified as Estonian and not English).
I can recommend you try the Yandex Translate API which is FREE for 1 million characters for 24 hours and up to 10 million characters a month. Starting May 27, 2020, free API keys aren't issued.
It supports (according to the documentation) over 60 languages.
您或许可以使用 Google Translate API 来检测语言和 如果需要的话翻译一下。
You can probably use the Google Translate API to detect the language and translate it if necessary.
您可以查看 如何检测 php 中字符串的语言 使用 Text_LanguageDetect Pear 包或下载它以像常规 php 库一样单独使用。
You can see how to detect language for a string in php using the Text_LanguageDetect Pear Package or downloading to use it separately like a regular php library.
我使用 https://github.com/patrickschur/language-detection 取得了良好的结果我在生产中使用它:
我的用法:我正在分析 CRM 系统的电子邮件,以了解电子邮件是用什么语言编写的,因此无法将文本发送到第三方服务。尽管《世界人权宣言》可能不是对电子邮件语言进行分类的最佳基础(因为电子邮件通常包含问候语等公式化部分,这不是《人权宣言》的一部分),但它在 99% 的电子邮件中识别了正确的语言。情况下,如果其中至少有 5 个单词。
更新:当使用语言检测库时,通过以下方法,我设法将电子邮件中的语言识别率提高到基本上 100%:
这些确实使库变慢了一些,所以我建议如果可能的话以异步方式使用它们并测量性能。就我而言,它速度足够快,而且准确得多。
I have had good results with https://github.com/patrickschur/language-detection and am using it in production:
My usage: I am analyzing emails for a CRM system to know what language an email was written in, so sending the text to a third party service was not an option. Even though the Universal Declaration of Human Rights is probably not the best basis to categorize the language of emails (as emails often have formulaic parts like greetings, which are not part of the Human Rights Declaration) it identifies the correct language in like 99% of cases, if there are at least 5 words in it.
Update: I managed to improve language recognition in emails to basically 100% when using the language-detection library with the following methods:
These do make the library a bit slower, so I would suggest to use them in an async way if possible and measure the performance. In my case it is more than fast enough and much more accurate.
一种方法可能是将输入字符串分解为单词,然后在英语词典中查找这些单词以查看其中存在多少个。这种方法有一些限制:
One approach might be to break the input string into words and then look up those words in an English dictionary to see how many of them are present. This approach has a few limitations:
也许将字符串提交给此语言猜测器:
http://www.xrce .xerox.com/competents/content-analysis/tools/guesser
Perhaps submit the string to this language guesser:
http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser
我会获取各种语言的文档并根据 Unicode 引用它们。然后,您可以使用一些贝叶斯推理来仅通过使用的 unicode 字符来确定它是哪种语言。这会将法语与英语或俄语区分开来。
我不确定除了在语言词典中查找单词来确定语言(使用类似的概率方法)之外还能做什么。
I would take documents from various languages and reference them against Unicode. You could then use some bayesian reasoning to determine which language it is by the just the unicode characters used. This would seperate French from English or Russian.
I am not sure exactly on what else could be done except lookup the words in language dictionaries to determine the language (using a similar probabilistic approach).
尝试使用ascii编码。
我使用该代码来确定我的社交机器人项目中的 ru\en 语言
try to use ascii encode.
i use that code to determine ru\en languages in my social bot project
对瑞士先生的回答补充法语和西班牙语:
Additional words for French and Spanish to Swiss Mister's answer:
我的回答是针对具体情况的。
这是我写的内容,用于查找字符串是否采用特定语言,但有一个条件 - 不同的语言有不同的字母表。
就我而言,单词可以是 3 种语言 - 英语、保加利亚语和希腊语(每种语言都有不同的字母表)。我需要查找文本是否为保加利亚语,以便稍后将其翻译为希腊语。
希望这对与我有类似情况的人有所帮助。
My answer is for specific case.
Here is what I wrote to find if string is in specific language, but there is one condition - different languages have different alphabets.
In my case the word(s) can be in 3 languages - english, bulgarian and greek (each with different alphabet). And I need to find if a text is in bulgarian, so later translate it to greek.
Hope this help someone with similar case to mine.