当前位置：文江博客话题详情

从 PHP 字符串中检测语言

发布于 2024-08-04 21:39:58 字数 43 浏览 11 评论 0原文

在PHP中，有没有办法检测字符串的语言？假设字符串是 UTF-8 格式。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

呆萌少年 2024-08-11 21:39:59

从 PHP 5.1 开始，我使用这种方法来检查非英语、西班牙语、法语字符，严格使用 PHP，没有任何额外的语言 API 或类。语言脚本列表来自： https://www.php.net /manual/en/regexp.reference.unicode.php 请参阅下文

一项改进是向 PHP 添加一个函数，列出所有支持的脚本语言，这样您就不必手动填写数组。

该用例用于阻止非拉丁语帖子发送到表单，以提高其垃圾邮件阻止能力，因为该表单收到了大量俄语、中文和阿拉伯语垃圾邮件帖子。自从实施以来，每周的数量从 40000 人减少到不足 5 人，而且最近 3 周内没有人。谷歌重新验证码正在使用，但它很容易被击败。＃使满意

<?php
$non_latin_text = "This is NOT english, spanish, or french (which are latin languages) because it has this char in it:  и";
$latin_text = "1234567890-=\][poiuytrewqasdfghjkl;'/.,mnbvcxz!@#$%^&*()_+|}{:\"?><QWERTYUIOPLKJHGFDSAZXCVBNM";

print_r(is_non_latin($non_latin_text)); //Returns TRUE
print_r(is_non_latin($latin_text)); //Returns FALSE
function is_non_latin($text)
{
   $text_script_languages = get_language_scripts($text);

   //All Latin characters and numbers which are Common and Latin.
   if (count($text_script_languages) == 2 && in_array('Common', $text_script_languages) && in_array('Latin', $text_script_languages))
   {
      return FALSE;
   }

   if (count($text_script_languages) == 1 && (in_array('Common', $text_script_languages) || in_array('Latin', $text_script_languages)))
   {
      return FALSE;
   }

   //If we are here, then the text had other language scripts in it.
   return TRUE;
}

function get_language_scripts($text)
{
   $scripts = array('Arabic', 'Armenian', 'Avestan', 'Balinese', 'Bamum', 'Batak', 'Bengali', 'Bopomofo', 'Brahmi', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal', 'Carian', 'Chakma', 'Cham', 'Cherokee', 'Common', 'Coptic', 'Cuneiform', 'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Egyptian_Hieroglyphs', 'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic', 'Greek', 'Gujarati', 'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', 'Imperial_Aramaic', 'Inherited', 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', 'Javanese', 'Kaithi', 'Kannada', 'Katakana', 'Kayah_Li', 'Kharoshthi', 'Khmer', 'Lao', 'Latin', 'Lepcha', 'Limbu', 'Linear_B', 'Lisu', 'Lycian', 'Lydian', 'Malayalam', 'Mandaic', 'Meetei_Mayek', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Mongolian', 'Myanmar', 'New_Tai_Lue', 'Nko', 'Ogham', 'Old_Italic', 'Old_Persian', 'Old_South_Arabian', 'Old_Turkic', 'Ol_Chiki', 'Oriya', 'Osmanya', 'Phags_Pa', 'Phoenician', 'Rejang', 'Runic', 'Samaritan', 'Saurashtra', 'Sharada', 'Shavian', 'Sinhala', 'Sora_Sompeng', 'Sundanese', 'Syloti_Nagri', 'Syriac', 'Tagalog', 'Tagbanwa', 'Tai_Le', 'Tai_Tham', 'Tai_Viet', 'Takri', 'Tamil', 'Telugu', 'Thaana', 'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Vai', 'Yi');
 
   $found_scripts = array();

   foreach ($scripts AS $key => $script)
   {
      if (!empty($script))
      {
         if (preg_match( '/[\p{'.$script.'}]/u', $text))
         {
            $found_scripts[] = $script;
         }
      }
   }

   return $found_scripts;
}

I used this method to check for non- english, spanish, french chars using strictly PHP without any extra language API or Classes as of PHP 5.1. The language scripts list comes from: https://www.php.net/manual/en/regexp.reference.unicode.php See below

An improvement would be to add a function to PHP that lists all supported script languages so that you dont have to fill in the array by hand.

The usecase was for blocking non-latin posts to a form to improve it's spam blocking as the form was receiving a lot of russian, chinese, and arabic spam posts. Since this was implemented, its gone from 40000/week to less than 5, with none in the last 3 weeks. Google Re-Captcha was in use but it was being defeated easily. #satisfied

<?php
$non_latin_text = "This is NOT english, spanish, or french (which are latin languages) because it has this char in it:  и";
$latin_text = "1234567890-=\][poiuytrewqasdfghjkl;'/.,mnbvcxz!@#$%^&*()_+|}{:\"?><QWERTYUIOPLKJHGFDSAZXCVBNM";

print_r(is_non_latin($non_latin_text)); //Returns TRUE
print_r(is_non_latin($latin_text)); //Returns FALSE
function is_non_latin($text)
{
   $text_script_languages = get_language_scripts($text);

   //All Latin characters and numbers which are Common and Latin.
   if (count($text_script_languages) == 2 && in_array('Common', $text_script_languages) && in_array('Latin', $text_script_languages))
   {
      return FALSE;
   }

   if (count($text_script_languages) == 1 && (in_array('Common', $text_script_languages) || in_array('Latin', $text_script_languages)))
   {
      return FALSE;
   }

   //If we are here, then the text had other language scripts in it.
   return TRUE;
}

function get_language_scripts($text)
{
   $scripts = array('Arabic', 'Armenian', 'Avestan', 'Balinese', 'Bamum', 'Batak', 'Bengali', 'Bopomofo', 'Brahmi', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal', 'Carian', 'Chakma', 'Cham', 'Cherokee', 'Common', 'Coptic', 'Cuneiform', 'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Egyptian_Hieroglyphs', 'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic', 'Greek', 'Gujarati', 'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', 'Imperial_Aramaic', 'Inherited', 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', 'Javanese', 'Kaithi', 'Kannada', 'Katakana', 'Kayah_Li', 'Kharoshthi', 'Khmer', 'Lao', 'Latin', 'Lepcha', 'Limbu', 'Linear_B', 'Lisu', 'Lycian', 'Lydian', 'Malayalam', 'Mandaic', 'Meetei_Mayek', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Mongolian', 'Myanmar', 'New_Tai_Lue', 'Nko', 'Ogham', 'Old_Italic', 'Old_Persian', 'Old_South_Arabian', 'Old_Turkic', 'Ol_Chiki', 'Oriya', 'Osmanya', 'Phags_Pa', 'Phoenician', 'Rejang', 'Runic', 'Samaritan', 'Saurashtra', 'Sharada', 'Shavian', 'Sinhala', 'Sora_Sompeng', 'Sundanese', 'Syloti_Nagri', 'Syriac', 'Tagalog', 'Tagbanwa', 'Tai_Le', 'Tai_Tham', 'Tai_Viet', 'Takri', 'Tamil', 'Telugu', 'Thaana', 'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Vai', 'Yi');
 
   $found_scripts = array();

   foreach ($scripts AS $key => $script)
   {
      if (!empty($script))
      {
         if (preg_match( '/[\p{'.$script.'}]/u', $text))
         {
            $found_scripts[] = $script;
         }
      }
   }

   return $found_scripts;
}

从 PHP 字符串中检测语言

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（19）

关于作者

相关话题

热门标签

推荐作者

苦中寻乐

lueluelue

嗼ふ静

王权女流氓

与花如笺

残酷

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。