匹配 ply 正则表达式中的 unicode

发布于 2024-07-07 13:17:46 字数 809 浏览 19 评论 0原文

我正在匹配标识符，但现在遇到一个问题：我的标识符允许包含 unicode 字符。因此，旧的做事方式是不够的：

t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*"

在我的标记语言解析器中，我匹配 unicode通过允许除我明确使用的字符之外的所有字符，因为我的标记语言只有两个或三个我需要以这种方式转义的字符。

如何将所有 unicode 字符与 python 正则表达式和 ply 匹配？这也是个好主意吗？

我想让人们在程序中使用像 Ω » « ° foo² väli π 这样的标识符作为标识符（变量名等）。哎呀！我希望人们能够用自己的语言编写程序（如果可行的话）！不管怎样，现在unicode在很多地方都得到了支持，它应该得到传播。

编辑：Python 正则表达式似乎无法识别 POSIX 字符类。

>>> import re
>>> item = re.compile(r'[[:word:]]')
>>> print item.match('e')
None

编辑：为了更好地解释我需要什么。我需要一个 regex - 匹配所有 unicode 可打印字符但根本不匹配 ASCII 字符。

编辑： r"\w" 做了一些我想要的事情，但它与 « » 不匹配，而且我还需要一个与数字不匹配的正则表达式。

原文

I'm matching identifiers, but now I have a problem: my identifiers are allowed to contain unicode characters. Therefore the old way to do things is not enough:

t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*"

In my markup language parser I match unicode characters by allowing all the characters except those I explicitly use, because my markup language only has two or three of characters I need to escape that way.

How do I match all unicode characters with python regexs and ply? Also is this a good idea at all?

I'd want to let people use identifiers like Ω » « ° foo² väli π as an identifiers (variable names and such) in their programs. Heck! I want that people could write programs in their own language if it's practical! Anyway unicode is supported nowadays in wide variety of places, and it should spread.

Edit: POSIX character classes doesnt seem to be recognised by python regexes.

>>> import re
>>> item = re.compile(r'[[:word:]]')
>>> print item.match('e')
None

Edit: To explain better what I need. I'd need a regex -thing that matches all the unicode printable characters but not ASCII characters at all.

Edit: r"\w" does a bit stuff what I want, but it does not match « », and I also need a regex that does not match numbers.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

荒路情人 2024-07-14 13:17:46

re 模块支持 \w 语法，其中：

如果设置了 UNICODE，这将匹配
字符 [0-9_] 加上任意内容
分类为字母数字
Unicode 字符属性数据库。

因此，以下示例显示如何匹配 unicode 标识符：

>>> import re
>>> m = re.compile('(?u)[^\W0-9]\w*')
>>> m.match('a')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('9')
>>> m.match('ab')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('a9')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('unicöde')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('ödipus')
<_sre.SRE_Match object at 0xb7d75410>

因此您要查找的表达式是： (?u)[^\W0-9]\w*

the re module supports the \w syntax which:

If UNICODE is set, this will match the
characters [0-9_] plus whatever is
classified as alphanumeric in the
Unicode character properties database.

therefore the following examples shows how to match unicode identifiers:

>>> import re
>>> m = re.compile('(?u)[^\W0-9]\w*')
>>> m.match('a')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('9')
>>> m.match('ab')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('a9')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('unicöde')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('ödipus')
<_sre.SRE_Match object at 0xb7d75410>

So the expression you look for is: (?u)[^\W0-9]\w*

回复收藏 0 原文

半衾梦 2024-07-14 13:17:46

您需要在 lex.lex 中传递参数重新标记：

lex.lex(reflags=re.UNICODE)

You need pass pass parameter reflags in lex.lex:

lex.lex(reflags=re.UNICODE)

回复收藏 0 原文

暮年慕年 2024-07-14 13:17:46

检查此问题的答案

从字符串中剥离不可打印的字符python

你只需要使用其他 unicode 字符类别即可

回复收藏 0 原文

静水深流 2024-07-14 13:17:46

在 Vinko 的帮助下解决了这个问题。

我意识到获取 unicode 范围是愚蠢的。所以我会这样做：

symbols = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()]))
symnums = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()]))

t_IDENTIFIER = "[^%s](\\.|[^%s])*" % (symnums, symbols)

我不知道 unicode 字符类。如果这个 unicode 的东西开始变得太复杂，我可以把原来的东西放在适当的位置。 UTF-8 支持仍然确保在 STRING 令牌处启用支持，这一点更为重要。

编辑：另一方面，我开始理解为什么编程语言中没有太多的 unicode 支持。这是一个丑陋的黑客，不是一个令人满意的解决方案。

Solved it with the help of Vinko.

I realised that getting unicode range is plain dumb. So I'll do this:

symbols = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()]))
symnums = re.escape(''.join([chr(i) for i in xrange(33, 127) if not chr(i).isalnum()]))

t_IDENTIFIER = "[^%s](\\.|[^%s])*" % (symnums, symbols)

I don't know about unicode character classses. If this unicode stuff starts getting too complicated, I can just put the original one in place. UTF-8 support still ensures the support is on at the STRING tokens, which is more important.

Edit: On other hand, I start understanding why there's not much unicode support in programming languages.. This is an ugly hack, not a satisfying solution.

回复收藏 0 原文