如何检测解码的字符串
我正在追寻 Perl 代码中的一个错误,该错误似乎基本上是此版本的一个版本:
基本上,在某些条件下, Encode::decode('utf8', $string)
在同一个字符串上被调用两次,随之而来的是欢闹。现在,最好的解决方案是找出导致双重解码的条件并阻止其发生。不幸的是,这是功能丰富的产品的成熟生产代码;找出这些条件并以不引入回归错误的方式修复它们似乎具有挑战性。
有没有一些快速可靠的方法来检测字符串是否已经从 utf8 解码?在这些调用之前插入“if”语句感觉有点笨拙,但应该是一个非常安全的解决方案。
I'm chasing a bug in Perl code that seems to fundamentally be a version of this:
"Cannot decode string with wide characters" appears on a weird place
Basically, under certain conditions, Encode::decode('utf8', $string)
is getting called twice on the same string, and hilarity ensues. Now, the best solution is to figure out what conditions are causing the double-decode and stop that from happening. Unfortunately, this is mature production code for feature-rich product; figuring out those conditions and fixing them in a way that doesn't introduce regression errors looks to be challenging.
Is there some fast reliable way to detect whether a string has already been decoded from utf8? Inserting "if" statements before those calls feels a tad kludgy, but ought to be a pretty safe fix.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不可能正确检测标量是否包含已解码的字符串。无法将该信息传达给 Perl,因此它也无法将其传达给您。充其量,人们可以猜测。您可以使用一些启发式方法。从最可靠到最不可靠:
如果字符串包含超过 255 个字符,则不会对其进行编码。这正是导致“宽字符”警告/错误的原因。
如果标量将使用 UTF-8 进行编码(如果已编码且标量包含有效的 UTF-8),则可能已进行编码。
如果标量使用 UTF-8 进行编码,并且标量不包含有效的 UTF-8,则它可能已被解码。
如果标量的
UTF8
标志打开,则字符串可能已解码。如果标量的
UTF8
标志关闭,则字符串可能未解码。您应该对所有输入进行解码并对所有输出进行编码。
It's impossible to correctly detect whether a scalar contains a decoded string or not. There's no way to communicate that info to Perl, so there's no way for it to communicate it to you. At best, one can guess. There are some heuristics you could use. From most reliable to least:
If the string contains characters above 255, it's not encoded. This is exactly what causes the "wide character" warning/error.
If the scalar would be encoded using UTF-8 if it was encoded and the scalar contains valid UTF-8, it's probably encoded.
If the scalar would be encoded using UTF-8 if it was encoded and the scalar does not contain valid UTF-8, it's probably decoded.
If the scalar's
UTF8
flag is on, then the string is probably decoded.If the scalar's
UTF8
flag is off, then the string is probably not decoded.You should decode all your inputs and encode all your outputs.
Encode 有一个 is_utf8 函数:
请注意,文档的标题是“Messing with Perl's Internals”,此函数可能会在未来的 Perl 版本中发生更改。
Encode has an is_utf8 function:
Notice that the caption of the documentation is "Messing with Perl's Internals", this function might change in future perl versions.