从 PHP 字符串中检测语言

发布于 2024-08-04 21:39:58 字数 43 浏览 11 评论 0原文

在PHP中,有没有办法检测字符串的语言?假设字符串是 UTF-8 格式。

In PHP, is there a way to detect the language of a string? Suppose the string is in UTF-8 format.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(19

呆萌少年 2024-08-11 21:39:59

从 PHP 5.1 开始,我使用这种方法来检查非英语、西班牙语、法语字符,严格使用 PHP,没有任何额外的语言 API 或类。语言脚本列表来自: https://www.php.net /manual/en/regexp.reference.unicode.php 请参阅下文

一项改进是向 PHP 添加一个函数,列出所有支持的脚本语言,这样您就不必手动填写数组。

该用例用于阻止非拉丁语帖子发送到表单,以提高其垃圾邮件阻止能力,因为该表单收到了大量俄语、中文和阿拉伯语垃圾邮件帖子。自从实施以来,每周的数量从 40000 人减少到不足 5 人,而且最近 3 周内没有人。谷歌重新验证码正在使用,但它很容易被击败。 #使满意

<?php
$non_latin_text = "This is NOT english, spanish, or french (which are latin languages) because it has this char in it:  и";
$latin_text = "1234567890-=\][poiuytrewqasdfghjkl;'/.,mnbvcxz!@#$%^&*()_+|}{:\"?><QWERTYUIOPLKJHGFDSAZXCVBNM";

print_r(is_non_latin($non_latin_text)); //Returns TRUE
print_r(is_non_latin($latin_text)); //Returns FALSE
function is_non_latin($text)
{
   $text_script_languages = get_language_scripts($text);

   //All Latin characters and numbers which are Common and Latin.
   if (count($text_script_languages) == 2 && in_array('Common', $text_script_languages) && in_array('Latin', $text_script_languages))
   {
      return FALSE;
   }

   if (count($text_script_languages) == 1 && (in_array('Common', $text_script_languages) || in_array('Latin', $text_script_languages)))
   {
      return FALSE;
   }

   //If we are here, then the text had other language scripts in it.
   return TRUE;
}

function get_language_scripts($text)
{
   $scripts = array('Arabic', 'Armenian', 'Avestan', 'Balinese', 'Bamum', 'Batak', 'Bengali', 'Bopomofo', 'Brahmi', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal', 'Carian', 'Chakma', 'Cham', 'Cherokee', 'Common', 'Coptic', 'Cuneiform', 'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Egyptian_Hieroglyphs', 'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic', 'Greek', 'Gujarati', 'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', 'Imperial_Aramaic', 'Inherited', 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', 'Javanese', 'Kaithi', 'Kannada', 'Katakana', 'Kayah_Li', 'Kharoshthi', 'Khmer', 'Lao', 'Latin', 'Lepcha', 'Limbu', 'Linear_B', 'Lisu', 'Lycian', 'Lydian', 'Malayalam', 'Mandaic', 'Meetei_Mayek', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Mongolian', 'Myanmar', 'New_Tai_Lue', 'Nko', 'Ogham', 'Old_Italic', 'Old_Persian', 'Old_South_Arabian', 'Old_Turkic', 'Ol_Chiki', 'Oriya', 'Osmanya', 'Phags_Pa', 'Phoenician', 'Rejang', 'Runic', 'Samaritan', 'Saurashtra', 'Sharada', 'Shavian', 'Sinhala', 'Sora_Sompeng', 'Sundanese', 'Syloti_Nagri', 'Syriac', 'Tagalog', 'Tagbanwa', 'Tai_Le', 'Tai_Tham', 'Tai_Viet', 'Takri', 'Tamil', 'Telugu', 'Thaana', 'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Vai', 'Yi');
 
   $found_scripts = array();

   foreach ($scripts AS $key => $script)
   {
      if (!empty($script))
      {
         if (preg_match( '/[\p{'.$script.'}]/u', $text))
         {
            $found_scripts[] = $script;
         }
      }
   }

   return $found_scripts;
}

I used this method to check for non- english, spanish, french chars using strictly PHP without any extra language API or Classes as of PHP 5.1. The language scripts list comes from: https://www.php.net/manual/en/regexp.reference.unicode.php See below

An improvement would be to add a function to PHP that lists all supported script languages so that you dont have to fill in the array by hand.

The usecase was for blocking non-latin posts to a form to improve it's spam blocking as the form was receiving a lot of russian, chinese, and arabic spam posts. Since this was implemented, its gone from 40000/week to less than 5, with none in the last 3 weeks. Google Re-Captcha was in use but it was being defeated easily. #satisfied

<?php
$non_latin_text = "This is NOT english, spanish, or french (which are latin languages) because it has this char in it:  и";
$latin_text = "1234567890-=\][poiuytrewqasdfghjkl;'/.,mnbvcxz!@#$%^&*()_+|}{:\"?><QWERTYUIOPLKJHGFDSAZXCVBNM";

print_r(is_non_latin($non_latin_text)); //Returns TRUE
print_r(is_non_latin($latin_text)); //Returns FALSE
function is_non_latin($text)
{
   $text_script_languages = get_language_scripts($text);

   //All Latin characters and numbers which are Common and Latin.
   if (count($text_script_languages) == 2 && in_array('Common', $text_script_languages) && in_array('Latin', $text_script_languages))
   {
      return FALSE;
   }

   if (count($text_script_languages) == 1 && (in_array('Common', $text_script_languages) || in_array('Latin', $text_script_languages)))
   {
      return FALSE;
   }

   //If we are here, then the text had other language scripts in it.
   return TRUE;
}

function get_language_scripts($text)
{
   $scripts = array('Arabic', 'Armenian', 'Avestan', 'Balinese', 'Bamum', 'Batak', 'Bengali', 'Bopomofo', 'Brahmi', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal', 'Carian', 'Chakma', 'Cham', 'Cherokee', 'Common', 'Coptic', 'Cuneiform', 'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Egyptian_Hieroglyphs', 'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic', 'Greek', 'Gujarati', 'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', 'Imperial_Aramaic', 'Inherited', 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', 'Javanese', 'Kaithi', 'Kannada', 'Katakana', 'Kayah_Li', 'Kharoshthi', 'Khmer', 'Lao', 'Latin', 'Lepcha', 'Limbu', 'Linear_B', 'Lisu', 'Lycian', 'Lydian', 'Malayalam', 'Mandaic', 'Meetei_Mayek', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Mongolian', 'Myanmar', 'New_Tai_Lue', 'Nko', 'Ogham', 'Old_Italic', 'Old_Persian', 'Old_South_Arabian', 'Old_Turkic', 'Ol_Chiki', 'Oriya', 'Osmanya', 'Phags_Pa', 'Phoenician', 'Rejang', 'Runic', 'Samaritan', 'Saurashtra', 'Sharada', 'Shavian', 'Sinhala', 'Sora_Sompeng', 'Sundanese', 'Syloti_Nagri', 'Syriac', 'Tagalog', 'Tagbanwa', 'Tai_Le', 'Tai_Tham', 'Tai_Viet', 'Takri', 'Tamil', 'Telugu', 'Thaana', 'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Vai', 'Yi');
 
   $found_scripts = array();

   foreach ($scripts AS $key => $script)
   {
      if (!empty($script))
      {
         if (preg_match( '/[\p{'.$script.'}]/u', $text))
         {
            $found_scripts[] = $script;
         }
      }
   }

   return $found_scripts;
}
瀞厅☆埖开 2024-08-11 21:39:59

您可以使用 Java 实现 Apache Tika 的模块,将结果插入到 txt 文件、数据库等中,然后使用 php 从文件、数据库等中读取。
如果您没有那么多内容,您可以使用 Google 的 API,但请记住您的调用将受到限制,并且您只能向 API 发送有限数量的字符。在撰写本文时,我已经完成了 API 的版本 1(结果不太准确)和实验室版本 2(在得知每天有 100,000 个字符的上限后我放弃了)的测试。

You could implement a module of Apache Tika with Java, insert the results into a txt file, a DB, etc and then read from the file, db, whatever with php.
If you don't have that much content, you could use Google's API, although keep in mind your calls will be limited, and you can only send a restricted number of characters to the API. At the time of writing I'd finished testing version 1 (which turned out to be not so accurate) and the labs version 2 (i ditched after i read that there's a 100,000 chars cap per day) of the API.

寄居人 2024-08-11 21:39:59

以下代码不需要任何 api 或巨大的依赖项。在此代码中,我们删除所有符号、html 标签(如果您正在使用 html)、html 实体和空格。

对于剩余的文本,我们检查英语字符数与非英语字符数。如果英文字符的数量大于非英文字符的数量,我们将其标记为英文字符串。

function is_english($string)
{
    // Removing html tags
    $string = strip_tags($string);

    // Removing html entities
    $string = preg_replace('/&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-fA-F]{1,6});/is', '', $string);
    
    // Removing symbols
    $string = preg_replace('/[-!$%^&*()_+|~=`{}\[\]:";\'<>?,.\/#’—-–]/si', '', $string);

    // Removing spaces
    $string = preg_replace('/\s+/si', '', $string);
    
    // Counting english characters
    preg_match_all('/\w+/si', $string, $english_match);
    $english_char = strlen(implode('', $english_match[0]));

    // Counting non english characters
    preg_match_all('/\W+/si', $string, $match);
    $non_english_char = strlen(implode('', $match[0]));

    // Checks if number of english characters are grater

    if ($english_char > $non_english_char)
    {
        return true;
    }

    return false;
}

Following code doesn't need any api or huge dependencies. In this code we remove all symbols , html tags (in case you are working with html), html entities and spaces.

With the remaining text we check number of english characters vs number of non english characters. If number of english characters are grater than number of non english characters we mark it as english string.

function is_english($string)
{
    // Removing html tags
    $string = strip_tags($string);

    // Removing html entities
    $string = preg_replace('/&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-fA-F]{1,6});/is', '', $string);
    
    // Removing symbols
    $string = preg_replace('/[-!$%^&*()_+|~=`{}\[\]:";\'<>?,.\/#’—-–]/si', '', $string);

    // Removing spaces
    $string = preg_replace('/\s+/si', '', $string);
    
    // Counting english characters
    preg_match_all('/\w+/si', $string, $english_match);
    $english_char = strlen(implode('', $english_match[0]));

    // Counting non english characters
    preg_match_all('/\W+/si', $string, $match);
    $non_english_char = strlen(implode('', $match[0]));

    // Checks if number of english characters are grater

    if ($english_char > $non_english_char)
    {
        return true;
    }

    return false;
}

离旧人 2024-08-11 21:39:58

我使用了 Text_LanguageDetect pear 包 并取得了一些合理的结果。它使用起来非常简单,并且有一个适度的 52 种语言数据库。缺点是无法检测东亚语言。

require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();
$result = $l->detect($text, 4);
if (PEAR::isError($result)) {
    echo $result->getMessage();
} else {
    print_r($result);
}

结果是:

Array
(
    [german] => 0.407037037037
    [dutch] => 0.288065843621
    [english] => 0.283333333333
    [danish] => 0.234526748971
)

I've used the Text_LanguageDetect pear package with some reasonable results. It's dead simple to use, and it has a modest 52 language database. The downside is no detection of Eastern Asian languages.

require_once 'Text/LanguageDetect.php';
$l = new Text_LanguageDetect();
$result = $l->detect($text, 4);
if (PEAR::isError($result)) {
    echo $result->getMessage();
} else {
    print_r($result);
}

results in:

Array
(
    [german] => 0.407037037037
    [dutch] => 0.288065843621
    [english] => 0.283333333333
    [danish] => 0.234526748971
)
多像笑话 2024-08-11 21:39:58

我知道这是一篇旧文章,但这是我在找不到任何可行的解决方案后开发的内容。

  • 其他建议对于我的情况来说都太繁琐了
  • 我在我的网站上支持有限数量的语言(目前两种:'en'和'de' -但解决方案是通用的)。
  • 我需要对用户生成的字符串的语言进行合理的猜测,并且我有一个后备方案(用户的语言设置)。
  • 因此,我想要一个误报率最小的解决方案 - 但不太关心误报

该解决方案使用语言中最常见的 20 个单词,计算这些单词在大海捞针中的出现次数。然后它只是比较计数第一和第二多的语言的计数。如果亚军人数少于冠军人数的10%,则冠军全部获得。

代码 - 非常欢迎任何有关速度改进的建议!

    function getTextLanguage($text, $default) {
      $supported_languages = array(
          'en',
          'de',
      );
      // German word list
      // from http://wortschatz.uni-leipzig.de/Papers/top100de.txt
      $wordList['de'] = array ('der', 'die', 'und', 'in', 'den', 'von', 
          'zu', 'das', 'mit', 'sich', 'des', 'auf', 'für', 'ist', 'im', 
          'dem', 'nicht', 'ein', 'Die', 'eine');
      // English word list
      // from http://en.wikipedia.org/wiki/Most_common_words_in_English
      $wordList['en'] = array ('the', 'be', 'to', 'of', 'and', 'a', 'in', 
          'that', 'have', 'I', 'it', 'for', 'not', 'on', 'with', 'he', 
          'as', 'you', 'do', 'at');
      // French word list
      // from https://1000mostcommonwords.com/1000-most-common-french-words/
      $wordList['fr'] = array ('comme', 'que',  'tait',  'pour',  'sur',  'sont',  'avec',
                         'tre',  'un',  'ce',  'par',  'mais',  'que',  'est',
                         'il',  'eu',  'la', 'et', 'dans', 'mot');

      // Spanish word list
      // from https://spanishforyourjob.com/commonwords/
      $wordList['es'] = array ('que', 'no', 'a', 'la', 'el', 'es', 'y',
                         'en', 'lo', 'un', 'por', 'qu', 'si', 'una',
                         'los', 'con', 'para', 'est', 'eso', 'las');
      // clean out the input string - note we don't have any non-ASCII 
      // characters in the word lists... change this if it is not the 
      // case in your language wordlists!
      $text = preg_replace("/[^A-Za-z]/", ' ', $text);
      // count the occurrences of the most frequent words
      foreach ($supported_languages as $language) {
        $counter[$language]=0;
      }
      for ($i = 0; $i < 20; $i++) {
        foreach ($supported_languages as $language) {
          $counter[$language] = $counter[$language] + 
            // I believe this is way faster than fancy RegEx solutions
            substr_count($text, ' ' .$wordList[$language][$i] . ' ');;
        }
      }
      // get max counter value
      // from http://stackoverflow.com/a/1461363
      $max = max($counter);
      $maxs = array_keys($counter, $max);
      // if there are two winners - fall back to default!
      if (count($maxs) == 1) {
        $winner = $maxs[0];
        $second = 0;
        // get runner-up (second place)
        foreach ($supported_languages as $language) {
          if ($language <> $winner) {
            if ($counter[$language]>$second) {
              $second = $counter[$language];
            }
          }
        }
        // apply arbitrary threshold of 10%
        if (($second / $max) < 0.1) {
          return $winner;
        } 
      }
      return $default;
    }

I know this is an old post, but here is what I developed after not finding any viable solution.

  • other suggestions are all too heavy and too cumbersome for my situation
  • I support a finite number of languages on my website (at the moment two: 'en' and 'de' - but solution is generalised for more).
  • I need a plausible guess about the language of a user-generated string, and I have a fallback (the language setting of the user).
  • So I want a solution with minimal false positives - but don't care so much about false negatives.

The solution uses the 20 most common words in a language, counts the occurrences of those in the haystack. Then it just compares the counts of the first and second most counted languages. If the runner-up number is less than 10% of the winner, the winner takes it all.

Code - Any suggestions for speed improvement are more than welcome!

    function getTextLanguage($text, $default) {
      $supported_languages = array(
          'en',
          'de',
      );
      // German word list
      // from http://wortschatz.uni-leipzig.de/Papers/top100de.txt
      $wordList['de'] = array ('der', 'die', 'und', 'in', 'den', 'von', 
          'zu', 'das', 'mit', 'sich', 'des', 'auf', 'für', 'ist', 'im', 
          'dem', 'nicht', 'ein', 'Die', 'eine');
      // English word list
      // from http://en.wikipedia.org/wiki/Most_common_words_in_English
      $wordList['en'] = array ('the', 'be', 'to', 'of', 'and', 'a', 'in', 
          'that', 'have', 'I', 'it', 'for', 'not', 'on', 'with', 'he', 
          'as', 'you', 'do', 'at');
      // French word list
      // from https://1000mostcommonwords.com/1000-most-common-french-words/
      $wordList['fr'] = array ('comme', 'que',  'tait',  'pour',  'sur',  'sont',  'avec',
                         'tre',  'un',  'ce',  'par',  'mais',  'que',  'est',
                         'il',  'eu',  'la', 'et', 'dans', 'mot');

      // Spanish word list
      // from https://spanishforyourjob.com/commonwords/
      $wordList['es'] = array ('que', 'no', 'a', 'la', 'el', 'es', 'y',
                         'en', 'lo', 'un', 'por', 'qu', 'si', 'una',
                         'los', 'con', 'para', 'est', 'eso', 'las');
      // clean out the input string - note we don't have any non-ASCII 
      // characters in the word lists... change this if it is not the 
      // case in your language wordlists!
      $text = preg_replace("/[^A-Za-z]/", ' ', $text);
      // count the occurrences of the most frequent words
      foreach ($supported_languages as $language) {
        $counter[$language]=0;
      }
      for ($i = 0; $i < 20; $i++) {
        foreach ($supported_languages as $language) {
          $counter[$language] = $counter[$language] + 
            // I believe this is way faster than fancy RegEx solutions
            substr_count($text, ' ' .$wordList[$language][$i] . ' ');;
        }
      }
      // get max counter value
      // from http://stackoverflow.com/a/1461363
      $max = max($counter);
      $maxs = array_keys($counter, $max);
      // if there are two winners - fall back to default!
      if (count($maxs) == 1) {
        $winner = $maxs[0];
        $second = 0;
        // get runner-up (second place)
        foreach ($supported_languages as $language) {
          if ($language <> $winner) {
            if ($counter[$language]>$second) {
              $second = $counter[$language];
            }
          }
        }
        // apply arbitrary threshold of 10%
        if (($second / $max) < 0.1) {
          return $winner;
        } 
      }
      return $default;
    }
野の 2024-08-11 21:39:58

您无法从字符类型检测语言。并且没有万无一失的方法可以做到这一点。

使用任何方法,您都只是进行有根据的猜测。有一些与数学相关的文章 在那里

You can not detect the language from the character type. And there are no foolproof ways to do this.

With any method, you're just doing an educated guess. There are available some math related articles out there

挖鼻大婶 2024-08-11 21:39:58

您可以使用 Google 的 AJAX 语言 API(现已不复存在)。

借助 AJAX 语言 API,您可以仅使用 Javascript 来翻译和检测网页中文本块的语言。此外,您可以在网页中的任何文本字段或文本区域上启用音译。例如,如果您要音译为印地语,此 API 将允许用户使用英语按语音拼出印地语单词,并将它们显示在印地语脚本中。

您可以自动检测字符串的语言

var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

,并翻译以 支持的语言(也已失效)

google.language.translate("Hello world", "en", "es", function(result) {
  if (!result.error) {
    var container = document.getElementById("translation");
    container.innerHTML = result.translation;
  }
});

You could do this entirely client side with Google's AJAX Language API (now defunct).

With the AJAX Language API, you can translate and detect the language of blocks of text within a webpage using only Javascript. In addition, you can enable transliteration on any textfield or textarea in your web page. For example, if you were transliterating to Hindi, this API will allow users to phonetically spell out Hindi words using English and have them appear in the Hindi script.

You can detect automatically a string's language

var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

And translate any string written in one of the supported languages (also defunct)

google.language.translate("Hello world", "en", "es", function(result) {
  if (!result.error) {
    var container = document.getElementById("translation");
    container.innerHTML = result.translation;
  }
});
往事随风而去 2024-08-11 21:39:58

由于 Google Translate API 即将作为免费服务关闭,您可以尝试这个免费的替代方案,它是 Google Translate API 的替代品:

http://detectlanguage.com

As Google Translate API is going closing down as a free service, you can try this free alternative, which is a replacement for Google Translate API:

http://detectlanguage.com

裂开嘴轻声笑有多痛 2024-08-11 21:39:58

Text_LanguageDetect pear 包产生了可怕的结果:“市中心豪华公寓”被检测为葡萄牙语...

Google API 仍然是最好的解决方案,他们提供 300 美元的免费信用并在向您收取任何费用之前发出警告

下面是一个超级简单的函数,使用 file_get_contents 进行下载API 检测到的语言,因此无需下载或安装库等。

function guess_lang($str) {

    $str = str_replace(" ", "%20", $str);

    $content = file_get_contents("https://translation.googleapis.com/language/translate/v2/detect?key=YOUR_API_KEY&q=".$str);

    $lang = (json_decode($content, true));

    if(isset($lang))
        return $lang["data"]["detections"][0][0]["language"];
 }

执行:

echo guess_lang("luxury apartments downtown montreal"); // returns "en"

您可以在此处获取 Google Translate API 密钥:https://console.cloud.google.com/apis/library/translate.googleapis.com/

这是获取短语的简单示例你去吧。对于更复杂的应用程序,您显然需要限制 API 密钥并使用该库。

Text_LanguageDetect pear package produced terrible results: "luxury apartments downtown" is detected as Portuguese...

Google API is still the best solution, they give 300$ free credit and warn before charging you anything

Below is a super simple function that uses file_get_contents to download the lang detected by the API, so no need to download or install libraries etc.

function guess_lang($str) {

    $str = str_replace(" ", "%20", $str);

    $content = file_get_contents("https://translation.googleapis.com/language/translate/v2/detect?key=YOUR_API_KEY&q=".$str);

    $lang = (json_decode($content, true));

    if(isset($lang))
        return $lang["data"]["detections"][0][0]["language"];
 }

Execute:

echo guess_lang("luxury apartments downtown montreal"); // returns "en"

You can get your Google Translate API key here: https://console.cloud.google.com/apis/library/translate.googleapis.com/

This is a simple example for short phrases to get you going. For more complex applications you'll want to restrict your API key and use the library obviously.

与君绝 2024-08-11 21:39:58

我尝试了 Text_LanguageDetect 库,但得到的结果不是很好(例如,文本“test”被识别为爱沙尼亚语而不是英语)。

我建议您尝试 Yandex Translate API,它免费,只需 1 次24 小时内可容纳 1000 万个字符,每月最多可容纳 1000 万个字符。自 2020 年 5 月 27 日起,不再颁发免费 API 密钥。
它支持(根据文档)60 多种语言。

<?php
function identifyLanguage($text)
{
    $baseUrl = "https://translate.yandex.net/api/v1.5/tr.json/detect?key=YOUR_API_KEY";
    $url = $baseUrl . "&text=" . urlencode($text);

    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_CAINFO, YOUR_CERT_PEM_FILE_LOCATION);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

    $output = curl_exec($ch);
    if ($output)
    {
        $outputJson = json_decode($output);
        if ($outputJson->code == 200)
        {
            if (strlen($outputJson->lang) > 0)
            {
                return $outputJson->lang;
            }
        }
    }
    
    return "unknown";
}

function translateText($text, $targetLang)
{
    $baseUrl = "https://translate.yandex.net/api/v1.5/tr.json/translate?key=YOUR_API_KEY";
    $url = $baseUrl . "&text=" . urlencode($text) . "&lang=" . urlencode($targetLang);

    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_CAINFO, YOUR_CERT_PEM_FILE_LOCATION);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

    $output = curl_exec($ch);
    if ($output)
    {
        $outputJson = json_decode($output);
        if ($outputJson->code == 200)
        {
            if (count($outputJson->text) > 0 && strlen($outputJson->text[0]) > 0)
            {
                return $outputJson->text[0];
            }
        }
    }
    
    return $text;
}

header("content-type: text/html; charset=UTF-8");

echo identifyLanguage("エクスペリエンス");
echo "<br>";
echo translateText("エクスペリエンス", "en");
echo "<br>";
echo translateText("エクスペリエンス", "es");
echo "<br>";
echo translateText("エクスペリエンス", "zh");
echo "<br>";
echo translateText("エクスペリエンス", "he");
echo "<br>";
echo translateText("エクスペリエンス", "ja");
echo "<br>";
?>

I tried the Text_LanguageDetect library and the results I got were not very good (for instance, the text "test" was identified as Estonian and not English).

I can recommend you try the Yandex Translate API which is FREE for 1 million characters for 24 hours and up to 10 million characters a month. Starting May 27, 2020, free API keys aren't issued.
It supports (according to the documentation) over 60 languages.

<?php
function identifyLanguage($text)
{
    $baseUrl = "https://translate.yandex.net/api/v1.5/tr.json/detect?key=YOUR_API_KEY";
    $url = $baseUrl . "&text=" . urlencode($text);

    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_CAINFO, YOUR_CERT_PEM_FILE_LOCATION);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

    $output = curl_exec($ch);
    if ($output)
    {
        $outputJson = json_decode($output);
        if ($outputJson->code == 200)
        {
            if (strlen($outputJson->lang) > 0)
            {
                return $outputJson->lang;
            }
        }
    }
    
    return "unknown";
}

function translateText($text, $targetLang)
{
    $baseUrl = "https://translate.yandex.net/api/v1.5/tr.json/translate?key=YOUR_API_KEY";
    $url = $baseUrl . "&text=" . urlencode($text) . "&lang=" . urlencode($targetLang);

    $ch = curl_init($url);

    curl_setopt($ch, CURLOPT_CAINFO, YOUR_CERT_PEM_FILE_LOCATION);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

    $output = curl_exec($ch);
    if ($output)
    {
        $outputJson = json_decode($output);
        if ($outputJson->code == 200)
        {
            if (count($outputJson->text) > 0 && strlen($outputJson->text[0]) > 0)
            {
                return $outputJson->text[0];
            }
        }
    }
    
    return $text;
}

header("content-type: text/html; charset=UTF-8");

echo identifyLanguage("エクスペリエンス");
echo "<br>";
echo translateText("エクスペリエンス", "en");
echo "<br>";
echo translateText("エクスペリエンス", "es");
echo "<br>";
echo translateText("エクスペリエンス", "zh");
echo "<br>";
echo translateText("エクスペリエンス", "he");
echo "<br>";
echo translateText("エクスペリエンス", "ja");
echo "<br>";
?>
不知在何时 2024-08-11 21:39:58

您或许可以使用 Google Translate API 来检测语言 如果需要的话翻译一下。

You can probably use the Google Translate API to detect the language and translate it if necessary.

情未る 2024-08-11 21:39:58

您可以查看 如何检测 php 中字符串的语言 使用 Text_LanguageDetect Pear 包或下载它以像常规 php 库一样单独使用。

You can see how to detect language for a string in php using the Text_LanguageDetect Pear Package or downloading to use it separately like a regular php library.

孤凫 2024-08-11 21:39:58

我使用 https://github.com/patrickschur/language-detection 取得了良好的结果我在生产中使用它:

  • 它使用语言中的 ngram 来检测最可能的语言(字符串越长/单词越多,它就越准确),这似乎是一种经过验证的可靠方法。
  • 支持 110 种语言,但您也可以将语言数量限制为仅您感兴趣的语言。
  • 可以轻松改进/定制训练器和语言检测器。它使用每种语言的《世界人权宣言》作为检测语言的基础,但如果您知道您遇到的句子类型,您可以轻松扩展或替换每种语言中使用的文本,并快速获得更好的结果。 “训练”这个库使其变得更好很容易。
  • 我建议在训练器中增加 setMaxNgrams(我将其设置为 9000)并运行一次,然后也在语言检测器类中使用该设置。更改 ngrams 数字有点不直观(我必须查看代码才能了解它是如何工作的),这是一个缺点,而且在我看来,默认值 (310) 总是太低。更多的 ngram 可以让猜测变得更好。
  • 因为库很小,所以相对容易理解正在发生的事情以及如何调整它。

我的用法:我正在分析 CRM 系统的电子邮件,以了解电子邮件是用什么语言编写的,因此无法将文本发送到第三方服务。尽管《世界人权宣言》可能不是对电子邮件语言进行分类的最佳基础(因为电子邮件通常包含问候语等公式化部分,这不是《人权宣言》的一部分),但它在 99% 的电子邮件中识别了正确的语言。情况下,如果其中至少有 5 个单词。

更新:当使用语言检测库时,通过以下方法,我设法将电子邮件中的语言识别率提高到基本上 100%:

  • 向(相关)语言样本添加其他常用短语,例如“Greetings”, “诚挚的问候”、“真诚的”。 《世界人权宣言》中没有使用此类表述。如果您正在分析人类交流,常用短语对语言识别有很大帮助,尤其是我的人类经常使用的公式化短语(“你好”、“祝你有美好的一天”)。
  • 将最大 ngram 长度设置为 4(而不是默认的 3)。
  • 像以前一样将 maxNgrams 保持在 9000。

这些确实使库变慢了一些,所以我建议如果可能的话以异步方式使用它们并测量性能。就我而言,它速度足够快,而且准确得多。

I have had good results with https://github.com/patrickschur/language-detection and am using it in production:

  • It uses ngrams in languages to detect the most likely language (the longer your string / the more words, the more accurate it will be), which seems like a solid proven method.
  • 110 languages are supported, but you can also limit the number of languages to only those you are interested in.
  • Trainer and Language detector can easily be improved / customized. It uses the Universal Declaration of Human Rights in each of the languages as the foundation to detect a language, but if you know what type of sentences you experience you can easily extend or replace the used texts in each language and get better results fast. "Training" this library to become better is easy.
  • I would suggest to increase setMaxNgrams (I set it to 9000) in the Trainer and run it once, and then also use that setting in the Language detector class. Changing the ngrams number is a bit unintuitive (I had to look through the code to find out how it works), which is a drawback, and the default (310) is always too low in my opinion. More ngrams makes the guessing a lot better.
  • Because the library is very small, it was relatively easy to understand what is happening and how to tweak it.

My usage: I am analyzing emails for a CRM system to know what language an email was written in, so sending the text to a third party service was not an option. Even though the Universal Declaration of Human Rights is probably not the best basis to categorize the language of emails (as emails often have formulaic parts like greetings, which are not part of the Human Rights Declaration) it identifies the correct language in like 99% of cases, if there are at least 5 words in it.

Update: I managed to improve language recognition in emails to basically 100% when using the language-detection library with the following methods:

  • Add additional common phrases to the (relevant) language samples, like "Greetings", "Best regards", "Sincerely". These kind of expressions are not used in the Universal Declaration of Human Rights. Commonly used phrases help the language recognition a lot, especially formulaic ones used often my humans ("Hello", "Have a nice day") if you are analyzing human communication.
  • Set the maximum ngram length to 4 (instead of the default 3).
  • Keep the maxNgrams at 9000 as before.

These do make the library a bit slower, so I would suggest to use them in an async way if possible and measure the performance. In my case it is more than fast enough and much more accurate.

烟雨凡馨 2024-08-11 21:39:58

一种方法可能是将输入字符串分解为单词,然后在英语词典中查找这些单词以查看其中存在多少个。这种方法有一些限制:

  • 专有名词可能无法很好地处理
  • 拼写错误可能会扰乱您的查找
  • 缩写如“lol”或“b4”不一定会出现在字典中

One approach might be to break the input string into words and then look up those words in an English dictionary to see how many of them are present. This approach has a few limitations:

  • proper nouns may not be handled well
  • spelling errors can disrupt your lookups
  • abbreviations like "lol" or "b4" won't necessarily be in the dictionary
无畏 2024-08-11 21:39:58

也许将字符串提交给此语言猜测器:

http://www.xrce .xerox.com/competents/content-analysis/tools/guesser

Perhaps submit the string to this language guesser:

http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser

吃颗糖壮壮胆 2024-08-11 21:39:58

我会获取各种语言的文档并根据 Unicode 引用它们。然后,您可以使用一些贝叶斯推理来仅通过使用的 unicode 字符来确定它是哪种语言。这会将法语与英语或俄语区分开来。

我不确定除了在语言词典中查找单词来确定语言(使用类似的概率方法)之外还能做什么。

I would take documents from various languages and reference them against Unicode. You could then use some bayesian reasoning to determine which language it is by the just the unicode characters used. This would seperate French from English or Russian.

I am not sure exactly on what else could be done except lookup the words in language dictionaries to determine the language (using a similar probabilistic approach).

红玫瑰 2024-08-11 21:39:58

尝试使用ascii编码。
我使用该代码来确定我的社交机器人项目中的 ru\en 语言

function language($string) {
        $ru = array("208","209","208176","208177","208178","208179","208180","208181","209145","208182","208183","208184","208185","208186","208187","208188","208189","208190","208191","209128","209129","209130","209131","209132","209133","209134","209135","209136","209137","209138","209139","209140","209141","209142","209143");
        $en = array("97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122");
        $htmlcharacters = array("<", ">", "&", "<", ">", "&");
        $string = str_replace($htmlcharacters, "", $string);
        //Strip out the slashes
        $string = stripslashes($string);
        $badthings = array("=", "#", "~", "!", "?", ".", ",", "<", ">", "/", ";", ":", '"', "'", "[", "]", "{", "}", "@", "$", "%", "^", "&", "*", "(", ")", "-", "_", "+", "|", "`");
        $string = str_replace($badthings, "", $string);
        $string = mb_strtolower($string);
        $msgarray = explode(" ", $string);
        $words = count($msgarray);
        $letters = str_split($msgarray[0]);
        $letters = ToAscii($letters[0]);
        $brackets = array("[",",","]");
        $letters = str_replace($brackets,  "", $letters);
        if (in_array($letters, $ru)) {
            $result = 'Русский' ; //russian
        } elseif (in_array($letters, $en)) {
            $result = 'Английский'; //english
        } else {
            $result = 'ошибка' . $letters; //error
        }} return $result;  

try to use ascii encode.
i use that code to determine ru\en languages in my social bot project

function language($string) {
        $ru = array("208","209","208176","208177","208178","208179","208180","208181","209145","208182","208183","208184","208185","208186","208187","208188","208189","208190","208191","209128","209129","209130","209131","209132","209133","209134","209135","209136","209137","209138","209139","209140","209141","209142","209143");
        $en = array("97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122");
        $htmlcharacters = array("<", ">", "&", "<", ">", "&");
        $string = str_replace($htmlcharacters, "", $string);
        //Strip out the slashes
        $string = stripslashes($string);
        $badthings = array("=", "#", "~", "!", "?", ".", ",", "<", ">", "/", ";", ":", '"', "'", "[", "]", "{", "}", "@", "$", "%", "^", "&", "*", "(", ")", "-", "_", "+", "|", "`");
        $string = str_replace($badthings, "", $string);
        $string = mb_strtolower($string);
        $msgarray = explode(" ", $string);
        $words = count($msgarray);
        $letters = str_split($msgarray[0]);
        $letters = ToAscii($letters[0]);
        $brackets = array("[",",","]");
        $letters = str_replace($brackets,  "", $letters);
        if (in_array($letters, $ru)) {
            $result = 'Русский' ; //russian
        } elseif (in_array($letters, $en)) {
            $result = 'Английский'; //english
        } else {
            $result = 'ошибка' . $letters; //error
        }} return $result;  
兲鉂ぱ嘚淚 2024-08-11 21:39:58

对瑞士先生的回答补充法语和西班牙语:

    // Franch word list
    // from https://1000mostcommonwords.com/1000-most-common-french-words/
    $wordList['fr'] = array ('comme', 'que',  'était',  'pour',  'sur',  'sont',  'avec',
                             'être',  'à',  'un',  'ce',  'par',  'mais',  'que',  'est',
                             'il',  'eu',  'la', 'et', 'dans');

    // Spanish word list
    // from https://spanishforyourjob.com/commonwords/
    $wordList['es'] = array ('que', 'no', 'a', 'la', 'el', 'es', 'y',
                             'en', 'lo', 'un', 'por', 'qué', 'si', 'una',
                             'los', 'con', 'para', 'está', 'eso', 'las');

Additional words for French and Spanish to Swiss Mister's answer:

    // Franch word list
    // from https://1000mostcommonwords.com/1000-most-common-french-words/
    $wordList['fr'] = array ('comme', 'que',  'était',  'pour',  'sur',  'sont',  'avec',
                             'être',  'à',  'un',  'ce',  'par',  'mais',  'que',  'est',
                             'il',  'eu',  'la', 'et', 'dans');

    // Spanish word list
    // from https://spanishforyourjob.com/commonwords/
    $wordList['es'] = array ('que', 'no', 'a', 'la', 'el', 'es', 'y',
                             'en', 'lo', 'un', 'por', 'qué', 'si', 'una',
                             'los', 'con', 'para', 'está', 'eso', 'las');
计㈡愣 2024-08-11 21:39:58

我的回答是针对具体情况的。
这是我写的内容,用于查找字符串是否采用特定语言,但有一个条件 - 不同的语言有不同的字母表。
就我而言,单词可以是 3 种语言 - 英语、保加利亚语和希腊语(每种语言都有不同的字母表)。我需要查找文本是否为保加利亚语,以便稍后将其翻译为希腊语。

class Language {
        protected $bgSymbols = array(
            'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ъ', 'ь', 'ч', 'щ', 'ш', 'ю', 'я',
            'А', 'Б', 'В', 'Г', 'Д', 'Е', 'Ж', 'З', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', 'П', 'Р', 'С', 'Т', 'У', 'Ф', 'Х', 'Ц', 'Ъ', 'Ь', 'Ч', 'Щ', 'Ш', 'Ю', 'Я'
        );
        
        public function checkIfForTranslate($string) {
            $result = false;
            $stringArray = array();
            preg_match_all('/./u', $string, $matches);
            if(isset($matches[0])) {
                $stringArray = $matches[0];
            }
            foreach($this->bgSymbols as $symbol) {
                $found = array_search($symbol, $stringArray);
                if($found !== false) {
                    $result = true;
                    break;
                }
            }
            return $result;
        }
    }

希望这对与我有类似情况的人有所帮助。

My answer is for specific case.
Here is what I wrote to find if string is in specific language, but there is one condition - different languages have different alphabets.
In my case the word(s) can be in 3 languages - english, bulgarian and greek (each with different alphabet). And I need to find if a text is in bulgarian, so later translate it to greek.

class Language {
        protected $bgSymbols = array(
            'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ъ', 'ь', 'ч', 'щ', 'ш', 'ю', 'я',
            'А', 'Б', 'В', 'Г', 'Д', 'Е', 'Ж', 'З', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', 'П', 'Р', 'С', 'Т', 'У', 'Ф', 'Х', 'Ц', 'Ъ', 'Ь', 'Ч', 'Щ', 'Ш', 'Ю', 'Я'
        );
        
        public function checkIfForTranslate($string) {
            $result = false;
            $stringArray = array();
            preg_match_all('/./u', $string, $matches);
            if(isset($matches[0])) {
                $stringArray = $matches[0];
            }
            foreach($this->bgSymbols as $symbol) {
                $found = array_search($symbol, $stringArray);
                if($found !== false) {
                    $result = true;
                    break;
                }
            }
            return $result;
        }
    }

Hope this help someone with similar case to mine.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文