如何检测解码的字符串

发布于 2024-11-29 00:48:58 字数 455 浏览 2 评论 0原文

我正在追寻 Perl 代码中的一个错误,该错误似乎基本上是此版本的一个版本:

“无法解码带有宽字符的字符串”出现在一个奇怪的地方

基本上,在某些条件下, Encode::decode('utf8', $string) 在同一个字符串上被调用两次,随之而来的是欢闹。现在,最好的解决方案是找出导致双重解码的条件并阻止其发生。不幸的是,这是功能丰富的产品的成熟生产代码;找出这些条件并以不引入回归错误的方式修复它们似乎具有挑战性。

有没有一些快速可靠的方法来检测字符串是否已经从 utf8 解码?在这些调用之前插入“if”语句感觉有点笨拙,但应该是一个非常安全的解决方案。

I'm chasing a bug in Perl code that seems to fundamentally be a version of this:

"Cannot decode string with wide characters" appears on a weird place

Basically, under certain conditions, Encode::decode('utf8', $string) is getting called twice on the same string, and hilarity ensues. Now, the best solution is to figure out what conditions are causing the double-decode and stop that from happening. Unfortunately, this is mature production code for feature-rich product; figuring out those conditions and fixing them in a way that doesn't introduce regression errors looks to be challenging.

Is there some fast reliable way to detect whether a string has already been decoded from utf8? Inserting "if" statements before those calls feels a tad kludgy, but ought to be a pretty safe fix.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

白首有我共你 2024-12-06 00:48:58

不可能正确检测标量是否包含已解码的字符串。无法将该信息传达给 Perl,因此它也无法将其传达给您。充其量,人们可以猜测。您可以使用一些启发式方法。从最可靠到最不可靠:

  1. 如果字符串包含超过 255 个字符,则不会对其进行编码。这正是导致“宽字符”警告/错误的原因。

    utf8::encode($s) if /[^\x00-\xFF]/;
    
  2. 如果标量将使用 UTF-8 进行编码(如果已编码且标量包含有效的 UTF-8),则可能已进行编码。

  3. 如果标量使用 UTF-8 进行编码,并且标量不包含有效的 UTF-8,则它可能已被解码。

    utf8::encode($s) if !utf8::decode(my $tmp = $s);
    
  4. 如果标量的 UTF8 标志打开,则字符串可能已解码。

  5. 如果标量的 UTF8 标志关闭,则字符串可能未解码。

    utf8::encode($s) if utf8::is_utf8($s);
    

您应该对所有输入进行解码并对所有输出进行编码。

It's impossible to correctly detect whether a scalar contains a decoded string or not. There's no way to communicate that info to Perl, so there's no way for it to communicate it to you. At best, one can guess. There are some heuristics you could use. From most reliable to least:

  1. If the string contains characters above 255, it's not encoded. This is exactly what causes the "wide character" warning/error.

    utf8::encode($s) if /[^\x00-\xFF]/;
    
  2. If the scalar would be encoded using UTF-8 if it was encoded and the scalar contains valid UTF-8, it's probably encoded.

  3. If the scalar would be encoded using UTF-8 if it was encoded and the scalar does not contain valid UTF-8, it's probably decoded.

    utf8::encode($s) if !utf8::decode(my $tmp = $s);
    
  4. If the scalar's UTF8 flag is on, then the string is probably decoded.

  5. If the scalar's UTF8 flag is off, then the string is probably not decoded.

    utf8::encode($s) if utf8::is_utf8($s);
    

You should decode all your inputs and encode all your outputs.

哎呦我呸! 2024-12-06 00:48:58

Encode 有一个 is_utf8 函数:

is_utf8(字符串[,检查])

[INTERNAL] 测试 STRING 中的 UTF8 标志是否打开。
如果 CHECK 为 true,还检查 STRING 中的数据是否格式良好
UTF-8。如果成功则返回 true,否则返回 false。

请注意,文档的标题是“Messing with Perl's Internals”,此函数可能会在未来的 Perl 版本中发生更改。

Encode has an is_utf8 function:

is_utf8(STRING [, CHECK])

[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING.
If CHECK is true, also checks the data in STRING for being well-formed
UTF-8. Returns true if successful, false otherwise.

Notice that the caption of the documentation is "Messing with Perl's Internals", this function might change in future perl versions.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文