如何用 ASCII 正则表达式模式表示 Unicode 字符?
正则表达式风格:C++ 中的 wxRegEx。
我需要匹配的字符串之一包含“…”(U+2026,水平省略号)等字符,粘贴到 Emacs 时会转换为 \205 和“” >»'(U+00BB,右向双角引号)粘贴到 Emacs(ASCII 源代码模式)时仍保留 »。
在正则表达式模式本身中,我尝试将 '...' 表示为 \205 和 \\205 但无济于事。
解决这个问题的正确方法是什么?
更新:wxRegEx文档指出,要表示Unicode字符,您可以使用\uwxyz(其中wxyz恰好是四个十六进制数字)Unicode字符U+wxyz本地字节排序中的strong>。
我尝试过,但由于某种原因它对我不起作用(还)。
RegEx flavor: wxRegEx in C++.
One of the strings that I need to match contains characters like '…' (U+2026, Horizontal Ellipsis) which translates to \205 when pasted to Emacs and '»' (U+00BB, Right-Pointing Double Angle Quotation Mark) which remains » when pasted to Emacs (ASCII source code mode).
In the regex pattern itself I tried representing '…' as both \205 and \\205 to no avail.
What is the right way of approaching this problem?
Update: The wxRegEx documentation states that to represent a Unicode character you use \uwxyz (where wxyz is exactly four hexadecimal digits) the Unicode character U+wxyz in the local byte ordering.
I tried that, but for some reason it doesn't work for me (yet).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这取决于语言。在许多语言中,不需要转义非 ASCII,但您可能必须告诉编译器源代码采用的编码方式。例如:
或
虽然对于 Perl、Python 和 Ruby 等语言,您可以将声明放在文件内,前提是它与 ASCII 向上兼容。例如:
这是最简单的方法,我强烈推荐它:只需将真正的 UTF-8 字符放入源代码中即可。如果你必须想办法逃避事情,那么,这就不太方便了。
如果您要使用转义符,那么如何以符号方式指定非 ASCII 也会因语言而异。在 Java 中,您可以通过
\uXXXX
使用 asquerous Java 预处理器:尽管我不推荐这种方式。如果要在某种模式中使用它,您可以延迟插值,这同时更干净和更混乱:
第二种机制使您无需在 Java 预处理器按照自己的方式处理后试图弄清楚它是什么(您不能使用
\u0022
但可以使用\\0022
),但它会搞砸你的 Pattern.CANON_EQ 标志。大多数其他语言都有比 Java 更直接的方法 - Java 也坚持使用难看的 UTF-16,除非您使用
java -encoding UTF-8
作为源代码。对 UTF-16 代理进行硬编码绝对是愚蠢的。不要这样做!在 Perl 中,您可以使用:
但您也可以象征性地命名它们
如果您愿意,最后一个可以变得更短:
所有这些都比将幻数硬编码到代码中无限优越。
这一切都假设您的语言支持 Unicode,但许多语言不支持。
It depends on the language. In many languages there’s no need to escape non-ASCII, but you may have to tell the compiler what encoding the source is in. For example:
or
Although with things like Perl, Python, and Ruby, you can put the declaration inside the file, providing it’s upwards compatible with ASCII. For example:
That’s the easiest way to do it, and I highly recommend it: just put the real UTF-8 characters in your source code. If you have to figure out to escape things, well, it’s far less convenient.
If you are going to use escapes, well, how you specify non-ASCII symbolically also varies by language. In Java you can use the asquerous Java preprocessor via
\uXXXX
:although I do not recommend that way. If it’s going to be used in a pattern, you can delay interpolation, which is cleaner and messier at the same time:
That second mechanism spares you from the trying to figure out what it is after the Java preprocessor has its way with it (you can’t use
\u0022
but can use\\0022
), but then it screws up your Pattern.CANON_EQ flag.Most other languages have a more straightforward way to do it that Java — which also insists on ugly UTF-16 unless you use
java -encoding UTF-8
for your source. Hardcoding UTF-16 surrogates is absolutely idiotic. Do not do it!!In Perl you could use:
but you can also name them symbolically
The last one can be made much shorter if you’d prefer:
All of those are just about infinitely superior to hardcoding magic numbers into your code.
This all assumes your language supports Unicode, but many do not.