perl中的编码问题
我有一个编码问题,想寻求帮助。我注意到如果我选择“UTF-8”作为编码,有(至少)两个双引号 "
和 “
。但是当我选择“ISO-8859-1 " 作为编码,我看到后面的双引号变成 ¡°
,或者有时例如 –
。
任何人都可以解释为什么会出现这种情况吗?如何匹配“
并将其替换为”
在 Perl 中使用正则表达式?
多谢。
I have an encoding question and would like to ask for help. I notice if I choose "UTF-8" as encoding, there are (at least) two double quotes "
and “
. But when I choose "ISO-8859-1" as the encoding, I see the latter double quote becomes ¡°
, or sometimes for example “
.
Could anyone please explain why this is the case? How can match “
and replace it with "
using regexp in perl?
Thanks a lot.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
ISO-8859-1 是一种每个字符一个字节的编码。花哨的 Unicode 双引号不在 ISO-8859-1 字符集中。所以您看到的是一个表示为 ISO-8859-1 字节序列的多字节字符。
要匹配这些奇怪的东西,请参阅 perlunicode 手册页,尤其是 \x{...}和 \N{...} 转义序列。
要回答您的问题,请尝试使用 \x{201C} 来匹配 Unicode 左双引号,并尝试使用 \x{201D} 来匹配右双引号。您在问题中错过了后者:-)。
[更新]
我应该提供我的参考资料...英国的一位好绅士在 ASCII 和 Unicode 引号。普通的 ASCII/ISO-8859-1 双引号称为引号。
ISO-8859-1 is a one-byte-per-character encoding. The fancy Unicode double-quotes are not in the ISO-8859-1 character set. So what you are seeing is a multi-byte character represented as a sequence of ISO-8859-1 bytes.
To match these weird things, see the perlunicode man page, especially the \x{...} and \N{...} escape sequences.
To answer your question, try \x{201C} to match the Unicode LEFT DOUBLE QUOTATION MARK and \x{201D} to match the RIGHT DOUBLE QUOTATION MARK. You missed the latter in your question :-).
[update]
I should have provided my reference... Some nice gentleman in the UK has a page on ASCII and Unicode quotation marks. The plain vanilla ASCII/ISO-8859-1 double-quote is just called QUOTATION MARK.
可能是这样的
旧帖子
会有所帮助..May be this
Old post
will help..