.NET 的正则表达式引擎如何处理 RTL+LTR 混合字符串？

发布于 2024-12-11 01:32:10 字数 709 浏览 5 评论 0原文

我有一个混合的希伯来语/英语字符串需要解析。字符串是这样构建的：

[3 hebrew] [2 english 2] [1 hebrew],

因此，它可以读作： 1 2 3，并且存储为 3 2 1 （文件中的确切字节序列，在十六进制编辑器中仔细检查，无论如何 RTL 只是显示属性）。 .NET 正则表达式解析器具有 RTL 选项，（当给定对于纯 LTR 文本）从字符串的右侧开始处理。

我想知道，何时应应用此选项从字符串中提取 [3 hebrew] 和 [2 english] 部分，或检查 [1 hebrew] 是否与字符串末尾匹配？是否有任何隐藏的细节或者没有什么可担心的（例如处理任何具有特殊 unicode 字符的 LTR 字符串时）？

另外，谁能给我推荐一个好的 RTL+LTR 文本编辑器？（担心 VS Express 有时会显示错误的文本，如果它甚至会开始弄乱保存的字符串 - 我想重新检查文件而不再使用十六进制编辑器）

原文

I have a mixed Hebrew/english string to parse.
The string is built like this:

[3 hebrew] [2 english 2] [1 hebrew],

So, it can be read as: 1 2 3, and it is stored as 3 2 1 (exact byte sequence in file, double-checked in hex editor, and anyway RTL is only the display attribute). .NET regex parser has RTL option, which (when given for plain LTR text) starts processing from right side of the string.

I am wondering, when this option should be applied to extract [3 hebrew] and [2 english] parts from the string,or to check if [1 hebrew] matches the end of the string? Are there any hidden specifics or there's nothing to worry about (like when processing any LTR string with special unicode characters)?

Also, can anyone recommend me a good RTL+LTR text editor? (afraid that VS Express displays the text wrong sometimes, and if it can even start messing the saved strings - I would like to re-check the files without using hex editors anymore)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寻找一个思念的角度 2024-12-18 01:32:10

RightToLeft 选项指的是正则表达式采用的字符序列的顺序，实际上应该称为 LastToFirst，因为在希伯来语和阿拉伯语中，它实际上是左-从右到右，并且使用混合 RLT 和 LTR 文本（例如您所描述的“从右到左”表达方式）就更不合适了。

这对速度（仅在搜索文本很大时才重要）和使用 startAt 索引完成的正则表达式（搜索字符串中早于 startAt 的内容）影响较小。 code> 而不是字符串后面的部分）。

例子；希望浏览器不要把这个搞得太乱：

string saying = "למכות is in כתר"; //Just because it amuses me that this is a saying whatever way round the browser puts malkuth and kether.
string kether = "כתר";
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying));//True
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying));//True, perhaps minutely faster but so little that noise would hide it.
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying, 2));//False
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying, 2));//True
//And to show that the ordering is codepoint rather than physical display ordering:
Console.WriteLine(new Regex("" + kether[0] + ".*" + kether[2]).IsMatch(saying));//True
Console.WriteLine(new Regex("" + kether[2] + ".*" + kether[0]).IsMatch(saying));//False

The RightToLeft option refers to the order through the character sequence that the regular expression takes, and should really be called LastToFirst since in the case of Hebrew and Arabic it is actually left-to-right, and with mixed RLT and LTR text such as you describe the expression "right to left" is even less appropriate.

This has a minor effect on speed (will only matter if the searched text is massive) and on regular expressions that are done with a startAt index (searching those earlier in the string than startAt rather than later in the string).

Examples; let's hope the browers don't mess this up too much:

string saying = "למכות is in כתר"; //Just because it amuses me that this is a saying whatever way round the browser puts malkuth and kether.
string kether = "כתר";
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying));//True
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying));//True, perhaps minutely faster but so little that noise would hide it.
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying, 2));//False
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying, 2));//True
//And to show that the ordering is codepoint rather than physical display ordering:
Console.WriteLine(new Regex("" + kether[0] + ".*" + kether[2]).IsMatch(saying));//True
Console.WriteLine(new Regex("" + kether[2] + ".*" + kether[0]).IsMatch(saying));//False

回复收藏 0 原文

~没有更多了~