将旧版 Perl 代码迁移到 UTF-8 时会出现哪些问题？

发布于 2024-08-12 05:48:26 字数 345 浏览 3 评论 0原文

到目前为止，我工作的项目仅在源代码中使用了 ASCII。由于 I18N 领域即将发生的一些变化，也因为我们在测试中需要一些 Unicode 字符串，我们正在考虑硬着头皮将源代码移至 UTF-8，同时使用 utf8 pragma ( use utf8;)

由于代码现在是 ASCII 格式，我预计代码本身不会有任何问题。但是，我不太清楚我们可能会遇到的任何副作用，但考虑到我们的环境（perl5.8.8、Apache2、mod_perl、带有 FreeTDS 驱动程序的 MSSQL Server），我认为很可能会出现一些副作用。

如果您过去进行过此类迁移：我会遇到什么问题？我该如何管理它们？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

葬﹪忆之殇 2024-08-19 05:48:26

utf8 pragma 只是告诉 Perl 您的源代码是 UTF-8 编码的。如果您在源代码中只使用了 ASCII，那么 Perl 理解源代码不会有任何问题。为了安全起见，您可能想在源代码管理中创建一个分支。 :)

如果您需要处理文件中的 UTF-8 数据，或将 UTF-8 写入文件，则需要在文件句柄上设置编码，并按照外部位的预期对数据进行编码。例如，请参见使用utf8编码的Perl脚本，可以打开GB2312编码的文件名吗？。

查看 Perl 文档，了解有关 Unicode 的信息：

另请参阅 Juerd 的 Perl Unicode 建议。

回复收藏 0 原文

野鹿林 2024-08-19 05:48:26

几年前，我将我们内部的 mod_perl 平台 (~35k LOC) 迁移到 UTF-8。以下是我们必须考虑/更改的事情：

尽管 Perl 文档建议“仅在必要时”，但仍应使用“use utf8;”在每个源文件中 - 它为您提供一致性。
将数据库转换为 UTF-8 并确保数据库配置将连接字符集设置为 UTF-8（在 MySQL 中，执行此操作时请注意 VARCHAR 的字段长度问题）
使用最新版本的 DBI - 旧版本未正确设置返回标量上的 utf8 标志
使用 Encode 模块，避免使用 Perl 内置的 utf8 函数，除非您确切知道
读取 UTF-8 文件时正在处理的数据，指定层 - open($fh,"< ;:utf8",$filename)
在 RedHat 风格的操作系统（甚至 2008 年版本）上，包含的库不喜欢读取存储在 utf8 标量中的 XML 文件 - 升级 perl 或仅使用 :raw< 较旧的 perls（甚至 5.8.x 版本）中的/code> 层
某些较旧的字符串函数可能是不可预测的 - 例如。 $b=substr(lc($utf8string),0,2048) 随机失败，但 $a=lc($utf8string);$b=substr($a,0,2048)< /代码> 有效！
记得转换您的输入 - 例如。在 Web 应用程序中，传入的表单数据可能需要解码，
确保所有开发人员都知道术语编码/解码的方式 - Perl 中的“utf8 字符串”采用 /de/ 编码形式，包含 utf8 数据的原始字节字符串是/en/-coded
正确处理你的 URL - /en/-将 utf8 字符串编码为字节，然后进行 %xx 编码以生成 URL 的 ASCII 形式，并 /de/-在从 mod_perl 读取它时对其进行编码（例如. $uri=utf_decode($r->uri()))
对于 Web 应用程序来说，请记住 HTTP 标头中的字符集会覆盖使用
我确信这一点不言而喻 - 如果您执行任何字节操作（例如数据包数据、按位操作，甚至 MIME 内容长度标头），请确保您使用字节而不是字符进行计算，
请确保您的开发人员知道如何确保他们的文本编辑器设置为 UTF-8，即使给定文件上没有 BOM
请记住确保您的版本控制系统（为了 google 的利益 - subversion/svn）将
尽可能正确地处理文件，坚持使用 ASCII文件名和变量名 - 这可以避免在移动代码或使用不同的开发工具时出现可移植性问题

还有一点 - 这是黄金法则 - 不要只是破解直到它起作用，确保你完全理解给定环境中发生的事情/解码情况！

我确信您已经解决了大部分问题，但希望所有这些都可以帮助那里的人避免我们经历的长时间调试。

A few years ago I moved our in-house mod_perl platform (~35k LOC) to UTF-8. Here are the things which we had to consider/change:

despite the perl doc advice of 'only where necessary', go for using 'use utf8;' in every source file - it gives you consistency.
convert your database to UTF-8 and ensure your DB config sets the connection charset to UTF-8 (in MySQL, watch out for field length issues with VARCHARs when doing this)
use a recent version of DBI - older versions don't correctly set the utf8 flag on returned scalars
use the Encode module, avoid using perl's built in utf8 functions unless you know exactly what data you're dealing with
when reading UTF-8 files, specify the layer - open($fh,"<:utf8",$filename)
on a RedHat-style OS (even 2008 releases) the included libraries won't like reading XML files stored in utf8 scalars - upgrade perl or just use the :raw layer
in older perls (even 5.8.x versions) some older string functions can be unpredictable - eg. $b=substr(lc($utf8string),0,2048) fails randomly but $a=lc($utf8string);$b=substr($a,0,2048) works!
remember to convert your input - eg. in a web app, incoming form data may need decoding
ensure all dev staff know which way around the terms encode/decode are - a 'utf8 string' in perl is in /de/-coded form, a raw byte string containg utf8 data is /en/-coded
handle your URLs properly - /en/-code a utf8 string into bytes and then do the %xx encoding to produce the ASCII form of the URL, and /de/-code it when reading it from mod_perl (eg. $uri=utf_decode($r->uri()))
one more for web apps, remember the charset in the HTTP header overrides the charset specified with <meta>
I'm sure this one goes without saying - if you do any byte operations (eg. packet data, bitwise operations, even an MIME Content-Length header) make sure you're calculating with bytes and not chars
make sure your developers know how to ensure their text editors are set to UTF-8 even if there's no BOM on a given file
remember to ensure your revision control system (for google's benefit - subversion/svn) will correctly handle the files
where possible, stick to ASCII for filenames and variable names - this avoids portability issues when moving code around or using different dev tools

One more - this is the golden rule - don't just hack til it works, make sure you fully understand what's happening in a given en/decoding situation!

I'm sure you already had most of these sorted out but hopefully all that helps someone out there avoid the many hours debugging which we went through.

回复收藏 0 原文

~没有更多了~