使用 PHP 的 intl (ICU) 功能检查有效的字符串编码

发布于 2024-11-18 21:40:05 字数 143 浏览 4 评论 0原文

使用 PHP ICU 的 intl 包装器中当前可用的功能,您将如何检查字符串编码的有效性? (例如检查有效的 UTF-8)

我知道可以使用 mbstring、inov() 和 PCRE 来完成,但我对这个问题特别感兴趣。

Using the features currently available in PHP's intl wrapper for ICU, how would you go about checking for validity of a string's encoding? (e.g. check for valid UTF-8)

I know it can be done with mbstring, iconv() and PCRE but I'm specifically interested in intl with this question.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

幸福丶如此 2024-11-25 21:40:05

从 PHP 5.5 开始可以使用 UConverter。该手册不存在。请参阅 https://wiki.php.net/rfc/uconverter 了解 API。

function replace_invalid_byte_sequence($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

function utf8_check_encoding($str)
{
    return $str === UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function utf8_check_encoding2($str)
{
    return $str === (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

// Table 3-8. Use of U+FFFD in UTF-8 Conversion
// http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
$str =  "\x61"."\xF1\x80\x80"."\xE1\x80"."\xC2"."\x62"."\x80"."\x63"
    ."\x80"."\xBF"."\x64";
$expected = 'a���b�c��d';

var_dump([
    $expected === replace_invalid_byte_sequence($str),
    $expected === replace_invalid_byte_sequence2($str)
],[
    false === utf8_check_encoding($str),
    false === utf8_check_encoding2($str)
]);

UConverter can be used Since PHP 5.5. The manual doesn't exist. See https://wiki.php.net/rfc/uconverter for API.

function replace_invalid_byte_sequence($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

function utf8_check_encoding($str)
{
    return $str === UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function utf8_check_encoding2($str)
{
    return $str === (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

// Table 3-8. Use of U+FFFD in UTF-8 Conversion
// http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
$str =  "\x61"."\xF1\x80\x80"."\xE1\x80"."\xC2"."\x62"."\x80"."\x63"
    ."\x80"."\xBF"."\x64";
$expected = 'a���b�c��d';

var_dump([
    $expected === replace_invalid_byte_sequence($str),
    $expected === replace_invalid_byte_sequence2($str)
],[
    false === utf8_check_encoding($str),
    false === utf8_check_encoding2($str)
]);
画尸师 2024-11-25 21:40:05

我做了一些挖掘,发现了 ICU unorm2_normalize() 文档。它的 pErrorCode out 参数很有趣。标准 ICU 错误代码从 utypes.h 的第 620 行左右开始。所以我尝试了这个测试脚本:

$s = 'tête-à-tête';
echo "normalizer_normalize(\$s) >> " 
     . var_export(normalizer_normalize($s), 1) . "\n";
$s = "\xFF" . $s;
echo "normalizer_normalize(\$s) >> " 
     . var_export($r=normalizer_normalize($s), 1) . "\n";
if ($r===false)
    echo "normalizer_normalize() error: " 
         . intl_get_error_message() . "\n";
// which outputs:
normalizer_normalize($s) >> 'tête-à-tête'
normalizer_normalize($s) >> false
normalizer_normalize() error: Error converting input string to UTF-16: U_INVALID_CHAR_FOUND

所以我想基于此的测试并寻找以下三个错误代码将是不良 UTF-8 编码的良好指示:

U_INVALID_CHAR_FOUND 字符转换:不可映射的输入序列。
U_TRUNCATED_CHAR_FOUND 字符转换:输入序列不完整。
U_ILLEGAL_CHAR_FOUND 字符转换:非法输入序列/输入单元组合。

或者当我感到懒惰时,我可以使用

normalizer_normalize($s)===false

顺便说一句:我对 ICU API 规范的这一行感到困惑:

pErrorCode 标准 ICU 错误代码。
它的输入值必须通过
U_SUCCESS() 测试,否则函数
立即返回。检查
U_FAILURE() 输出或与
函数链。 (请参阅用户指南
详细信息。)

“函数立即返回”短语鼓励重新执行我的测试,但“函数”是否指的是 unorm2_normalize() 或 U_SUCCESS()?有什么想法吗?

I did some digging and found ICU unorm2_normalize() documentation. Its pErrorCode out parameter is interesting. The standard ICU error codes start around line 620 of utypes.h. So I tried this test script:

$s = 'tête-à-tête';
echo "normalizer_normalize(\$s) >> " 
     . var_export(normalizer_normalize($s), 1) . "\n";
$s = "\xFF" . $s;
echo "normalizer_normalize(\$s) >> " 
     . var_export($r=normalizer_normalize($s), 1) . "\n";
if ($r===false)
    echo "normalizer_normalize() error: " 
         . intl_get_error_message() . "\n";
// which outputs:
normalizer_normalize($s) >> 'tête-à-tête'
normalizer_normalize($s) >> false
normalizer_normalize() error: Error converting input string to UTF-16: U_INVALID_CHAR_FOUND

So I guess a test based on that and looking for the following three error codes would be a decent indication of bad UTF-8 encoding:

U_INVALID_CHAR_FOUND Character conversion: Unmappable input sequence.
U_TRUNCATED_CHAR_FOUND Character conversion: Incomplete input sequence.
U_ILLEGAL_CHAR_FOUND Character conversion: Illegal input sequence/combination of input units.

Or when I'm feeling lazy I could just use

normalizer_normalize($s)===false

Btw: I'm confused by this line of the ICU API spec:

pErrorCode Standard ICU error code.
Its input value must pass the
U_SUCCESS() test, or else the function
returns immediately. Check for
U_FAILURE() on output or use with
function chaining. (See User Guide for
details.)

The "the function returns immediately" phrase is encouraging re performance of my test but does "the function" refer to unorm2_normalize() or U_SUCCESS()? Any ideas?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文