正则表达式来确认字符串是否是有效的Python标识符?
我对标识符有以下定义:
Identifier --> letter{ letter| digit}
基本上我有一个标识符函数,它从文件中获取字符串并对其进行测试以确保它是上面定义的有效标识符。
我已经尝试过这个:
if re.match('\w+(\w\d)?', i):
return True
else:
return False
但是当我运行我的程序时,每次它遇到一个整数时,它都会认为它是一个有效的标识符。
例如,
c = 0 ;
它打印 c
作为有效标识符,这很好,但它也打印 0
作为有效标识符。
我在这里做错了什么?
I have the following definition for an Identifier:
Identifier --> letter{ letter| digit}
Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.
I've tried this:
if re.match('\w+(\w\d)?', i):
return True
else:
return False
but when I run my program every time it meets an integer it thinks that it's a valid identifier.
For example
c = 0 ;
it prints c
as a valid identifier which is fine, but it also prints 0
as a valid identifer.
What am I doing wrong here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这个问题是 10 年前提出的,当时 Python 2 仍然占主导地位。正如过去十年的许多评论所表明的那样,我的答案需要认真更新,首先需要注意的是:
没有简单的正则表达式可以正确匹配所有(且唯一)有效的 Python 标识符。它不适用于 Python 2,也不适用于 Python 3。
原因是:
正如 JoeCondron 指出,Python保留关键字如
True
、if
、return
是无效有效标识符,因此仅靠简单的正则表达式无法处理此问题,需要进行额外的过滤。Python 3 允许在标识符中使用非 ASCII 字母和数字,但有效标识符的词法解析器接受的字母和数字的 Unicode 类别与
的相同类别不匹配
、re
模块中的 \d\w
、\W
,如 martineau 的 反例并详细解释由 Hatshepsut 的令人惊叹的研究。我们可以尝试使用
keyword.iskeyword()
为 Alexander Huszagh 建议,或者将(巨大的)正则表达式否定先行子句中的所有关键字列为Feurmurmel 指出。我们可以通过限制仅使用 ASCII 标识符来解决另一个问题。但是,面对所有这些麻烦和限制,为什么还要使用正则表达式呢??
正如哈特谢普苏特所说:
只要使用它,问题就解决了。
PS:如果您只是为此点赞我,请也点赞 最初发布的答案此解决方案!
根据问题的要求,我最初的2012年答案呈现了一个正则表达式基于 Python 2 标识符的官方定义
:用正则表达式表达:
示例:
结果:
并记住使用
keyword.iskeyword()
以避免像最后一个这样的误报。Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:
No simple regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.
The reasons are:
As JoeCondron pointed out, Python reserved keywords such as
True
,if
,return
are not valid identifiers, so simple regexes alone are unable to handle this and additional filtering is required.Python 3 allows non-ASCII letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of
\d
,\w
,\W
in there
module, as demonstrated in martineau's counter-example and explained in great detail by Hatshepsut's amazing research.We could try to solve the first issue either by using
keyword.iskeyword()
as Alexander Huszagh suggested, or by listing all keywords in a (huge) regex negative lookahead clause as pointed by Feuermurmel. And we can workaround the other issue by restricting ourselves to ASCII-only identifiers.But, with all this trouble and limitations, why bother using a regex at all?
As Hatshepsut said:
Just use it, problem solved.
PS: If you upvote me just for this, please also upvote the answer that originally posted this solution!
As requested by the question, my original 2012 answer presents a regular expression based on the Python 2 official definition of an identifier:
Which can be expressed by the regular expression:
Example:
Result:
And remember to use
keyword.iskeyword()
to avoid false positives such as the last one.str.isidentifier()
有效。正则表达式的答案错误地未能匹配一些有效的 python 标识符,并且错误地匹配了一些无效的标识符。@martineau 的评论给出了正则表达式解决方案失败的
'℘᧚'
示例。为什么会发生这种情况?
让我们定义与给定正则表达式匹配的代码点集合,以及与
str.isidentifier
匹配的代码点集合。有多少正则表达式匹配不是标识符?
有多少标识符不是正则表达式匹配的?
有趣——哪些?
这两套有什么不同呢?
它们具有不同的 Unicode“常规类别”值。
来自 wikipedia,即
字母,修饰符
;信件、其他
;数字,其他
。这与 re 文档 一致,因为\d
仅是十进制数字:那么其他方式呢?
这就是
标记,无间距
;符号,数学
;符号,其他
。这一切都记录在哪里?
在哪里实现的?
我仍然
想要一个正则表达式
查看 PyPI 上的 regex 模块。
它包括“常规类别”的过滤器。
str.isidentifier()
works. The regex answers incorrectly fail to match some valid python identifiers and incorrectly match some invalid ones.@martineau's comment gives the example of
'℘᧚'
where the regex solutions fail.Why does this happen?
Lets define the sets of code points that match the given regular expression, and the set that match
str.isidentifier
.How many regex matches are not identifiers?
How many identifiers are not regex matches?
Interesting -- which ones?
What's different about these two sets?
They have different Unicode "General Category" values.
From wikipedia, that's
Letter, modifier
;Letter, other
;Number, other
. This is consistent with the re docs, since\d
is only decimal digits:What about the other way?
That's
Mark, nonspacing
;Symbol, math
;Symbol, other
.Where is this all documented?
Where is it implemented?
https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255
I still want a regular expression
Look at the regex module on PyPI.
It includes filters for "General Category".
对于Python 3,您需要处理Unicode 字母和数字。因此,如果这是一个问题,您应该这样处理:
[^\d\W]
匹配不是数字的字符,也不是“非字母数字”,它翻译为“是一个字符”字母或下划线”。For Python 3, you need to handle Unicode letters and digits. So if that's a concern, you should get along with this:
[^\d\W]
matches a character that is not a digit and not "not alphanumeric" which translates to "a character that is a letter or underscore".\w 匹配数字和字符。尝试
^[_a-zA-Z]\w*$
\w matches digits and characters. Try
^[_a-zA-Z]\w*$
问题是关于正则表达式的,所以我的回答可能看起来偏离主题。关键是正则表达式根本不是正确的方法。
有兴趣获取有问题的字符吗?
使用
str.isidentifier
,可以逐字符执行检查,并在它们前面加上下划线以避免误报,例如数字等...如果名称的一个(前缀)组件不是(?),那么名称如何有效?例如,哪种解决方案处理未经授权的第一个字符,例如数字或例如
᧚.请参阅
待改进,因为它既不检查保留名称,也不告诉任何关于为什么
᧚
有时会出现问题,而ϔ
总是有问题......但这是我的没有-正则表达式方法。我的用你的的方式回答:
可以归纳为
The question is about regex, so my answer may look out of subject. The point is that regex is simply not the right approach.
Interested in getting the problematic characters ?
Using
str.isidentifier
, one can perform the check character by character, prefixing them with, say, an underscore to avoid false positive such as digits and so on... How could a name be valid if one of its (prefixed) component is not (?) E.g.Which solution deals with unauthorised first characters, such as digits or e.g.
᧚
. SeeTo be improved, since it does check neither for reserved names, nor tells anything about why
᧚
is sometime problematic, whereas₂
always is... but here is my without-regex approach.My answer in your terms:
which could be contracted into
我需要一个有效的正则表达式(即我不能只使用
str.isidentifier
),因为我需要查找嵌入在字符串中的所有标识符,而不仅仅是测试整个字符串是否是有效标识符。我也无法使用ast
模块,因为我预计该字符串不是有效的 Python 语法。所以现有的答案没有帮助,我对“使用正则表达式包”不满意。因此,这里是一个实际的正则表达式,可以完成这项工作,以及构建它和测试它的代码。I needed a working regex (i.e. I couldn't just use
str.isidentifier
) because I needed to find all identifiers embedded in a string, not just test if a whole string was a valid identifier. I also couldn't use theast
module because I expected the string to not be valid Python syntax. So the existing answers didn't help, and I wasn't satisfied with 'use the regex package'. So here's an actual regex that does the job, along with the code for constructing it and testing it.