如何用 ASCII 正则表达式模式表示 Unicode 字符?

发布于 2024-10-13 02:53:28 字数 579 浏览 8 评论 0原文

正则表达式风格:C++ 中的 wxRegEx

我需要匹配的字符串之一包含“”(U+2026,水平省略号)等字符,粘贴到 Emacs 时会转换为 \205 和“” >»'(U+00BB,右向双角引号)粘贴到 Emacs(ASCII 源代码模式)时仍保留 »

在正则表达式模式本身中,我尝试将 '...' 表示为 \205\\205 但无济于事。

解决这个问题的正确方法是什么?

更新:wxRegEx文档指出,要表示Unicode字符,您可以使用\uwxyz(其中wxyz恰好是四个十六进制数字)Unicode字符U+wxyz本地字节排序中的strong>。

我尝试过,但由于某种原因它对我不起作用(还)。

RegEx flavor: wxRegEx in C++.

One of the strings that I need to match contains characters like '' (U+2026, Horizontal Ellipsis) which translates to \205 when pasted to Emacs and '»' (U+00BB, Right-Pointing Double Angle Quotation Mark) which remains » when pasted to Emacs (ASCII source code mode).

In the regex pattern itself I tried representing '' as both \205 and \\205 to no avail.

What is the right way of approaching this problem?

Update: The wxRegEx documentation states that to represent a Unicode character you use \uwxyz (where wxyz is exactly four hexadecimal digits) the Unicode character U+wxyz in the local byte ordering.

I tried that, but for some reason it doesn't work for me (yet).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

雅心素梦 2024-10-20 02:53:28

这取决于语言。在许多语言中,不需要转义非 ASCII,但您可能必须告诉编译器源代码采用的编码方式。例如:

$ java -encoding UTF-8 SomeThing.java

$ perl -Mutf8 somescript

虽然对于 Perl、Python 和 Ruby 等语言,您可以将声明放在文件内,前提是它与 ASCII 向上兼容。例如:

#!/usr/bin/perl

use utf8;
use strict;
use warnings;
use autodie;

my $s = "Où se trouve mon élève?";

if ($s =~ /élève/) { ... }

# although of course this also works fine:

while ($s =~ /\b(\w+)\b/g) {
     print "Found <$1>\n";  
}

这是最简单的方法,我强烈推荐它:只需将真正的 UTF-8 字符放入源代码中即可。如果你必须想办法逃避事情,那么,这就不太方便了。

如果您要使用转义符,那么如何以符号方式指定非 ASCII 也会因语言而异。在 Java 中,您可以通过 \uXXXX 使用 asquerous Java 预处理器:

String s = "e\u0301le\u0300ve";

尽管我不推荐这种方式。如果要在某种模式中使用它,您可以延迟插值,这同时更干净和更混乱:

String s = "e\\u0301le\\u0300ve";

第二种机制使您无需在 Java 预处理器按照自己的方式处理后试图弄清楚它是什么(您不能使用 \u0022 但可以使用 \\0022),但它会搞砸你的 Pattern.CANON_EQ 标志。

大多数其他语言都有比 Java 更直接的方法 - Java 也坚持使用难看的 UTF-16,除非您使用 java -encoding UTF-8 作为源代码。对 UTF-16 代理进行硬编码绝对是愚蠢的。不要这样做!

在 Perl 中,您可以使用:

my $s = "e\x{301}le\x{300}ve";  # NFD form
my $s = "\xE9l\xE8ve";          # NFC form

但您也可以象征性地命名它们

use charnames qw< :full >;
my $s_as_NFD = "e\N{COMBINING ACUTE ACCENT}le\N{COMBINING GRAVE ACCENT}e";
my $s_as_NFC = "\N{LATIN SMALL LETTER E WITH ACUTE}l\N{LATIN SMALL LETTER E WITH GRAVE}ve";

如果您愿意,最后一个可以变得更短:

use charnames qw< :full latin >;
my $s_as_NFC = "\N{e WITH ACUTE}l\N{e WITH GRAVE}ve";

所有这些都比将幻数硬编码到代码中无限优越。

这一切都假设您的语言支持 Unicode,但许多语言不支持。

It depends on the language. In many languages there’s no need to escape non-ASCII, but you may have to tell the compiler what encoding the source is in. For example:

$ java -encoding UTF-8 SomeThing.java

or

$ perl -Mutf8 somescript

Although with things like Perl, Python, and Ruby, you can put the declaration inside the file, providing it’s upwards compatible with ASCII. For example:

#!/usr/bin/perl

use utf8;
use strict;
use warnings;
use autodie;

my $s = "Où se trouve mon élève?";

if ($s =~ /élève/) { ... }

# although of course this also works fine:

while ($s =~ /\b(\w+)\b/g) {
     print "Found <$1>\n";  
}

That’s the easiest way to do it, and I highly recommend it: just put the real UTF-8 characters in your source code. If you have to figure out to escape things, well, it’s far less convenient.

If you are going to use escapes, well, how you specify non-ASCII symbolically also varies by language. In Java you can use the asquerous Java preprocessor via \uXXXX:

String s = "e\u0301le\u0300ve";

although I do not recommend that way. If it’s going to be used in a pattern, you can delay interpolation, which is cleaner and messier at the same time:

String s = "e\\u0301le\\u0300ve";

That second mechanism spares you from the trying to figure out what it is after the Java preprocessor has its way with it (you can’t use \u0022 but can use \\0022), but then it screws up your Pattern.CANON_EQ flag.

Most other languages have a more straightforward way to do it that Java — which also insists on ugly UTF-16 unless you use java -encoding UTF-8 for your source. Hardcoding UTF-16 surrogates is absolutely idiotic. Do not do it!!

In Perl you could use:

my $s = "e\x{301}le\x{300}ve";  # NFD form
my $s = "\xE9l\xE8ve";          # NFC form

but you can also name them symbolically

use charnames qw< :full >;
my $s_as_NFD = "e\N{COMBINING ACUTE ACCENT}le\N{COMBINING GRAVE ACCENT}e";
my $s_as_NFC = "\N{LATIN SMALL LETTER E WITH ACUTE}l\N{LATIN SMALL LETTER E WITH GRAVE}ve";

The last one can be made much shorter if you’d prefer:

use charnames qw< :full latin >;
my $s_as_NFC = "\N{e WITH ACUTE}l\N{e WITH GRAVE}ve";

All of those are just about infinitely superior to hardcoding magic numbers into your code.

This all assumes your language supports Unicode, but many do not.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文