当前位置：文江博客话题详情

将文本识别为简体中文与繁体中文

发布于 09-30 01:08 字数 48 浏览 7 评论 0原文

给定一段已知为中文且以 UTF-8 编码的文本块，有没有办法确定它是简体还是繁体？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

君勿笑2024-10-07 01:08:09

我不知道这是否有效，但我会尝试使用 iconv 来查看它是否能在字符集之间正确转换，并将相同转换的结果与 //TRANSLIT 和 //IGNORE 进行比较。如果两个结果匹配，则字符集转换没有遇到任何无法翻译的字符，因此应该有一个匹配。

$test1 = iconv("UTF-8", "big5//TRANSLIT", $text);
$test2 = iconv("UTF-8", "big5//IGNORE", $text);
if ($test1 == $test2) {
   echo 'traditional';
} else {
   $test3 = iconv("UTF-8", "gb2312//TRANSLIT", $text);
   $test4 = iconv("UTF-8", "gb2312//IGNORE", $text);
   if ($test3 == $test4) {
      echo 'simplified';
   } else {
      echo 'Failed to match either traditional or simplified';
   }
}

I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match.

$test1 = iconv("UTF-8", "big5//TRANSLIT", $text);
$test2 = iconv("UTF-8", "big5//IGNORE", $text);
if ($test1 == $test2) {
   echo 'traditional';
} else {
   $test3 = iconv("UTF-8", "gb2312//TRANSLIT", $text);
   $test4 = iconv("UTF-8", "gb2312//IGNORE", $text);
   if ($test3 == $test4) {
      echo 'simplified';
   } else {
      echo 'Failed to match either traditional or simplified';
   }
}

回复收藏 0 原文

梦幻的味道2024-10-07 01:08:09

由于 big5 和 gb2312 省略了 Unicode 中存在的一些常用变体，因此代码依赖于 translit 和 之间的精确匹配>ignore 模式在很多正常用例中都会失败：尽管 说 在香港是一种常见的变体，但它无法将 说话 识别为繁体中文对于 说，它在 big5 中使用。

一个简单的解决方法是以模糊的方式进行：

$test1 = iconv("UTF-8", "big5//IGNORE", $text);
$test2 = iconv("UTF-8", "gb2312//IGNORE", $text);
$len1 = mb_strlen($test1);
$len2 = mb_strlen($test2);
$len0 = mb_strlen($text) * 0.8; // threshold
if ($len1 > $len2 && $len1 > $len0) {
    return 'Likely Traditional';
}
if ($len2 > $len1 && $len2 > $len0) {
    return 'Likely Simplified';
}
return 'Could not identify';

Since big5 and gb2312 omit quite a few commonly used variants that are present in Unicode, the code rely on exact match between the translit and ignore modes would fail in quite a lot of normal use cases: it would fail to identify 説話 as Traditional Chinese despite 説 being a common variant in Hong Kong for 說 which is used in big5.

A simple fix is to do it in a fuzzy way:

$test1 = iconv("UTF-8", "big5//IGNORE", $text);
$test2 = iconv("UTF-8", "gb2312//IGNORE", $text);
$len1 = mb_strlen($test1);
$len2 = mb_strlen($test2);
$len0 = mb_strlen($text) * 0.8; // threshold
if ($len1 > $len2 && $len1 > $len0) {
    return 'Likely Traditional';
}
if ($len2 > $len1 && $len2 > $len0) {
    return 'Likely Simplified';
}
return 'Could not identify';

回复收藏 0 原文

~没有更多了~