Lua 模式匹配逗号周围
我有几个小位置标记,例如“א,א”“א,ב”。如果我们使用逗号作为中心点,则逗号之前最多需要 2 个字符,直到逗号之后的下一个空格。
我有 (.-,.-)%s
但它没有做我需要的事情。有什么想法吗?
另外,正如您所看到的,没有拉丁字母,因此使用 %l
将不起作用。
I have several small place marks such as 'א,א' 'א,ב'. If we use the comma as the center point, i need at most 2 characters before the comma, and up to the next space after the comma.
I have (.-,.-)%s
but its not doing what I need. Any idea?
Also as you can see there not latin letters so using %l
will not work.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这里有几个问题。首先,一个小问题:
.-,
将在逗号之前匹配尽可能少的字符,即零个字符。您应该锚定匹配字符串的开头。更复杂的问题是您使用希伯来字母。问题是 Lua 没有多字节字符的概念。
如果您使用 8 位编码,例如 Windows-1255 或 ISO-8859-8,那么您可能可以简单地匹配字符类
[ת-א]
。如果您已正确设置希伯来语区域设置,则%l
应该可以正常工作。如果您使用 UTF-8 或任何其他使用多字节字符的编码,则必须构造一个正则表达式,将所有希伯来字母转义为八位位组序列。 aleph 是 U+05D0x,在 UTF-8 中将表示为
0xD7 0x90
。 tav 为 U+05EA,将被编码为0xD7 0xAA
。在Lua中,您可以使用反斜杠+十进制代码转义任何8位字符。所有以 UTF-8 编码的希伯来语字符的第一个字节都是相同的 -
0xD7
,即"\215"
。第二个字符可以是"\144"
到"\170"
之间的任何字符。因此,匹配单个希伯来字母的正则表达式是:"\215[\144-\170]"
。将其放入原始正则表达式中,其中有与任何字符匹配的单个点。当然,对于与 UTF-8 不同的编码,必须修改上述推理。希伯来语从右到左的书写方向是另一件事要记住。
There are couple of issues here. First, a minor issue:
.-,
will match as little as possible before the coma, that is zero characters. You should anchor the beginning of the matched string.The more complicated issue is that you use Hebrew letters. The problem is that Lua has no concept of multi-byte characters.
If you use a 8-bit encoding such as Windows-1255, or ISO-8859-8, then you probably can simply match against a character class
[ת-א]
. If you have properly set Hebrew locale,%l
should work fine for you.If you use UTF-8 or any other encoding that uses multi-byte characters, then you must construct a regex that has all the Hebrew alphabet escaped as a sequence of octets. The aleph is U+05D0x, which in UTF-8 will be represented as
0xD7 0x90
. The tav is U+05EA, which will be encoded as0xD7 0xAA
.In Lua you can escape any 8-bit character with a backslash + decimal code. All the hebrew characters encoded in UTF-8 have the first byte the same --
0xD7
, that is"\215"
. The second character can be anything from"\144"
to"\170"
. Thus, the regex that will match a single Hebrew letter is:"\215[\144-\170]"
. Put that in your original regex, where you had single dots that match any character.Of course, the above reasoning must be modified for encodings different than UTF-8. Right-to-left writing direction in Hebrew is another thing to keep in mind.