匹配 ply 正则表达式中的 unicode
我正在匹配标识符,但现在遇到一个问题:我的标识符允许包含 unicode 字符。 因此,旧的做事方式是不够的:
t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*"
在 我的标记语言 解析器中,我匹配 unicode通过允许除我明确使用的字符之外的所有字符,因为我的标记语言只有两个或三个我需要以这种方式转义的字符。
如何将所有 unicode 字符与 python 正则表达式和 ply 匹配? 这也是个好主意吗?
我想让人们在程序中使用像 Ω » « ° foo² väli π 这样的标识符作为标识符(变量名等)。 哎呀! 我希望人们能够用自己的语言编写程序(如果可行的话)! 不管怎样,现在unicode在很多地方都得到了支持,它应该得到传播。
编辑:Python 正则表达式似乎无法识别 POSIX 字符类。
>>> import re
>>> item = re.compile(r'[[:word:]]')
>>> print item.match('e')
None
编辑:为了更好地解释我需要什么。 我需要一个 regex - 匹配所有 unicode 可打印字符但根本不匹配 ASCII 字符。
编辑: r"\w" 做了一些我想要的事情,但它与 « » 不匹配,而且我还需要一个与数字不匹配的正则表达式。
I'm matching identifiers, but now I have a problem: my identifiers are allowed to contain unicode characters. Therefore the old way to do things is not enough:
t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*"
In my markup language parser I match unicode characters by allowing all the characters except those I explicitly use, because my markup language only has two or three of characters I need to escape that way.
How do I match all unicode characters with python regexs and ply? Also is this a good idea at all?
I'd want to let people use identifiers like Ω » « ° foo² väli π as an identifiers (variable names and such) in their programs. Heck! I want that people could write programs in their own language if it's practical! Anyway unicode is supported nowadays in wide variety of places, and it should spread.
Edit: POSIX character classes doesnt seem to be recognised by python regexes.
>>> import re
>>> item = re.compile(r'[[:word:]]')
>>> print item.match('e')
None
Edit: To explain better what I need. I'd need a regex -thing that matches all the unicode printable characters but not ASCII characters at all.
Edit: r"\w" does a bit stuff what I want, but it does not match « », and I also need a regex that does not match numbers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
re 模块支持 \w 语法,其中:
因此,以下示例显示如何匹配 unicode 标识符:
因此您要查找的表达式是: (?u)[^\W0-9]\w*
the re module supports the \w syntax which:
therefore the following examples shows how to match unicode identifiers:
So the expression you look for is: (?u)[^\W0-9]\w*
您需要在 lex.lex 中传递参数重新标记:
You need pass pass parameter reflags in lex.lex:
检查此问题的答案
从字符串中剥离不可打印的字符python
你只需要使用其他 unicode 字符类别即可
Check the answers to this question
Stripping non printable characters from a string in python
you'd just need to use the other unicode character categories instead
在 Vinko 的帮助下解决了这个问题。
我意识到获取 unicode 范围是愚蠢的。 所以我会这样做:
我不知道 unicode 字符类。 如果这个 unicode 的东西开始变得太复杂,我可以把原来的东西放在适当的位置。 UTF-8 支持仍然确保在 STRING 令牌处启用支持,这一点更为重要。
编辑:另一方面,我开始理解为什么编程语言中没有太多的 unicode 支持。这是一个丑陋的黑客,不是一个令人满意的解决方案。
Solved it with the help of Vinko.
I realised that getting unicode range is plain dumb. So I'll do this:
I don't know about unicode character classses. If this unicode stuff starts getting too complicated, I can just put the original one in place. UTF-8 support still ensures the support is on at the STRING tokens, which is more important.
Edit: On other hand, I start understanding why there's not much unicode support in programming languages.. This is an ugly hack, not a satisfying solution.
也许POSIX 字符类适合您?
Probably POSIX character classes are right for you?