正则表达式来确认字符串是否是有效的Python标识符？

发布于 2024-10-27 21:35:02 字数 436 浏览 6 评论 0原文

我对标识符有以下定义：

Identifier --> letter{ letter| digit}

基本上我有一个标识符函数，它从文件中获取字符串并对其进行测试以确保它是上面定义的有效标识符。

我已经尝试过这个：

if re.match('\w+(\w\d)?', i):     
  return True
else:
  return False

但是当我运行我的程序时，每次它遇到一个整数时，它都会认为它是一个有效的标识符。

例如，

c = 0 ;

它打印 c 作为有效标识符，这很好，但它也打印 0 作为有效标识符。

我在这里做错了什么？

原文

I have the following definition for an Identifier:

Identifier --> letter{ letter| digit}

Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.

I've tried this:

if re.match('\w+(\w\d)?', i):     
  return True
else:
  return False

but when I run my program every time it meets an integer it thinks that it's a valid identifier.

For example

c = 0 ;

it prints c as a valid identifier which is fine, but it also prints 0 as a valid identifer.

What am I doing wrong here?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌吟 2024-11-03 21:35:02

这个问题是 10 年前提出的，当时 Python 2 仍然占主导地位。正如过去十年的许多评论所表明的那样，我的答案需要认真更新，首先需要注意的是：

没有简单的正则表达式可以正确匹配所有（且唯一）有效的 Python 标识符。它不适用于 Python 2，也不适用于 Python 3。

原因是：

正如 JoeCondron 指出，Python保留关键字如True、if、return 是无效有效标识符，因此仅靠简单的正则表达式无法处理此问题，需要进行额外的过滤。
Python 3 允许在标识符中使用非 ASCII 字母和数字，但有效标识符的词法解析器接受的字母和数字的 Unicode 类别与 的相同类别不匹配re 模块中的 \d、\w、\W，如 martineau 的反例并详细解释由 Hatshepsut 的令人惊叹的研究。

我们可以尝试使用 keyword.iskeyword() 为 Alexander Huszagh 建议，或者将（巨大的）正则表达式否定先行子句中的所有关键字列为Feurmurmel 指出。我们可以通过限制仅使用 ASCII 标识符来解决另一个问题。

但是，面对所有这些麻烦和限制，为什么还要使用正则表达式呢？？

正如哈特谢普苏特所说：

str.isidentifier() 有效

只要使用它，问题就解决了。

^{PS：如果您只是为此点赞我，请也点赞最初发布的答案此解决方案！}

根据问题的要求，我最初的2012年答案呈现了一个正则表达式基于 Python 2 标识符的官方定义

identifier ::=  (letter|"_") (letter | digit | "_")*

：用正则表达式表达：

^[^\d\W]\w*\Z

示例：

import re
identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE)

tests = [ "a", "a1", "_a1", "1a", "aa$%@%", "aa bb", "aa_bb", "aa\n", "if" ]
for test in tests:
    result = re.match(identifier, test)
    print("%r\t= %s" % (test, (result is not None)))

结果：

'a'      = True
'a1'     = True
'_a1'    = True
'1a'     = False
'aa$%@%' = False
'aa bb'  = False
'aa_bb'  = True
'aa\n'   = False
'if'     = True

并记住使用keyword.iskeyword()以避免像最后一个这样的误报。

Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:

No simple regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.

The reasons are:

As JoeCondron pointed out, Python reserved keywords such as True, if, return are not valid identifiers, so simple regexes alone are unable to handle this and additional filtering is required.
Python 3 allows non-ASCII letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of \d, \w, \W in the re module, as demonstrated in martineau's counter-example and explained in great detail by Hatshepsut's amazing research.

We could try to solve the first issue either by using keyword.iskeyword() as Alexander Huszagh suggested, or by listing all keywords in a (huge) regex negative lookahead clause as pointed by Feuermurmel. And we can workaround the other issue by restricting ourselves to ASCII-only identifiers.

But, with all this trouble and limitations, why bother using a regex at all?

As Hatshepsut said:

str.isidentifier() works

Just use it, problem solved.

^{PS: If you upvote me just for this, please also upvote the answer that originally posted this solution!}

As requested by the question, my original 2012 answer presents a regular expression based on the Python 2 official definition of an identifier:

identifier ::=  (letter|"_") (letter | digit | "_")*

Which can be expressed by the regular expression:

^[^\d\W]\w*\Z

Example:

import re
identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE)

tests = [ "a", "a1", "_a1", "1a", "aa$%@%", "aa bb", "aa_bb", "aa\n", "if" ]
for test in tests:
    result = re.match(identifier, test)
    print("%r\t= %s" % (test, (result is not None)))

Result:

'a'      = True
'a1'     = True
'_a1'    = True
'1a'     = False
'aa$%@%' = False
'aa bb'  = False
'aa_bb'  = True
'aa\n'   = False
'if'     = True

And remember to use keyword.iskeyword() to avoid false positives such as the last one.

回复收藏 0 原文

只是一片海 2024-11-03 21:35:02

str.isidentifier()有效。正则表达式的答案错误地未能匹配一些有效的 python 标识符，并且错误地匹配了一些无效的标识符。

str.isidentifier() 如果字符串是有效标识符，则返回 true
根据语言定义，标识符部分和
关键字。
使用keyword.iskeyword()测试保留标识符，例如def
和班级。

@martineau 的评论给出了正则表达式解决方案失败的 '℘᧚' 示例。

>>> '℘᧚'.isidentifier()
True
>>> import re
>>> bool(re.search(r'^[^\d\W]\w*\Z', '℘᧚'))
False

为什么会发生这种情况？

让我们定义与给定正则表达式匹配的代码点集合，以及与 str.isidentifier 匹配的代码点集合。

import re
import unicodedata

chars = {chr(i) for i in range(0x10ffff) if re.fullmatch(r'^[^\d\W]\w*\Z', chr(i))}
identifiers = {chr(i) for i in range(0x10ffff) if chr(i).isidentifier()}

有多少正则表达式匹配不是标识符？

In [26]: len(chars - identifiers)                                                                                                               
Out[26]: 698

有多少标识符不是正则表达式匹配的？

In [27]: len(identifiers - chars)                                                                                                               
Out[27]: 4

有趣——哪些？

In [37]: {(c, unicodedata.name(c), unicodedata.category(c)) for c in identifiers - chars}                                                       
Out[37]: 
set([
    ('\u1885', 'MONGOLIAN LETTER ALI GALI BALUDA', 'Mn'),
    ('\u1886', 'MONGOLIAN LETTER ALI GALI THREE BALUDA', 'Mn'),
    ('℘', 'SCRIPT CAPITAL P', 'Sm'),
    ('℮', 'ESTIMATED SYMBOL', 'So'),
])

这两套有什么不同呢？

它们具有不同的 Unicode“常规类别”值。

In [31]: {unicodedata.category(c) for c in chars - identifiers}                                                                                 
Out[31]: set(['Lm', 'Lo', 'No'])

来自 wikipedia，即字母，修饰符； 信件、其他； 数字，其他。这与 re 文档一致，因为 \d 仅是十进制数字：

\d 匹配任何 Unicode 十进制数字（即 Unicode 字符类别 [Nd] 中的任何字符）

那么其他方式呢？

In [32]: {unicodedata.category(c) for c in identifiers - chars}                                                                                 
Out[32]: set(['Mn', 'Sm', 'So'])

这就是标记，无间距； 符号，数学； 符号，其他。

这一切都记录在哪里？

在哪里实现的？

我仍然

想要一个正则表达式

查看 PyPI 上的 regex 模块。

此正则表达式实现向后兼容标准“re”模块，但提供了附加功能。

它包括“常规类别”的过滤器。

str.isidentifier() works. The regex answers incorrectly fail to match some valid python identifiers and incorrectly match some invalid ones.

str.isidentifier() Return true if the string is a valid identifier
according to the language definition, section Identifiers and
keywords.
Use keyword.iskeyword() to test for reserved identifiers such as def
and class.

@martineau's comment gives the example of '℘᧚' where the regex solutions fail.

>>> '℘᧚'.isidentifier()
True
>>> import re
>>> bool(re.search(r'^[^\d\W]\w*\Z', '℘᧚'))
False

Why does this happen?

Lets define the sets of code points that match the given regular expression, and the set that match str.isidentifier.

import re
import unicodedata

chars = {chr(i) for i in range(0x10ffff) if re.fullmatch(r'^[^\d\W]\w*\Z', chr(i))}
identifiers = {chr(i) for i in range(0x10ffff) if chr(i).isidentifier()}

How many regex matches are not identifiers?

In [26]: len(chars - identifiers)                                                                                                               
Out[26]: 698

How many identifiers are not regex matches?

In [27]: len(identifiers - chars)                                                                                                               
Out[27]: 4

Interesting -- which ones?

In [37]: {(c, unicodedata.name(c), unicodedata.category(c)) for c in identifiers - chars}                                                       
Out[37]: 
set([
    ('\u1885', 'MONGOLIAN LETTER ALI GALI BALUDA', 'Mn'),
    ('\u1886', 'MONGOLIAN LETTER ALI GALI THREE BALUDA', 'Mn'),
    ('℘', 'SCRIPT CAPITAL P', 'Sm'),
    ('℮', 'ESTIMATED SYMBOL', 'So'),
])

What's different about these two sets?

They have different Unicode "General Category" values.

In [31]: {unicodedata.category(c) for c in chars - identifiers}                                                                                 
Out[31]: set(['Lm', 'Lo', 'No'])

From wikipedia, that's Letter, modifier; Letter, other; Number, other. This is consistent with the re docs, since \d is only decimal digits:

\d Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])

What about the other way?

In [32]: {unicodedata.category(c) for c in identifiers - chars}                                                                                 
Out[32]: set(['Mn', 'Sm', 'So'])

That's Mark, nonspacing; Symbol, math; Symbol, other.

Where is this all documented?

Where is it implemented?

https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255

I still want a regular expression

Look at the regex module on PyPI.

This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.

It includes filters for "General Category".

回复收藏 0 原文

南巷近海 2024-11-03 21:35:02

对于Python 3，您需要处理Unicode 字母和数字。因此，如果这是一个问题，您应该这样处理：

re_ident = re.compile(r"^[^\d\W]\w*$", re.UNICODE)

[^\d\W] 匹配不是数字的字符，也不是“非字母数字”，它翻译为“是一个字符”字母或下划线”。

For Python 3, you need to handle Unicode letters and digits. So if that's a concern, you should get along with this:

re_ident = re.compile(r"^[^\d\W]\w*$", re.UNICODE)

[^\d\W] matches a character that is not a digit and not "not alphanumeric" which translates to "a character that is a letter or underscore".

回复收藏 0 原文

陈年往事 2024-11-03 21:35:02

\w 匹配数字和字符。尝试 ^[_a-zA-Z]\w*$

回复收藏 0 原文

卸妝后依然美 2024-11-03 21:35:02

^{问题是关于正则表达式的，所以我的回答可能看起来偏离主题。关键是正则表达式根本不是正确的方法。}

有兴趣获取有问题的字符吗？

使用str.isidentifier，可以逐字符执行检查，并在它们前面加上下划线以避免误报，例如数字等...如果名称的一个（前缀）组件不是（？），那么名称如何有效？例如，

def checker(str_: str) -> 'set[str]':
    return {
        c for i, c in enumerate(str_)
        if not (f'_{c}' if i else c).isidentifier()
    }

>>> checker('℘3᧚₂')
{'₂'}

哪种解决方案处理未经授权的第一个字符，例如数字或例如 ᧚.请参阅

>>> checker('᧚℘3₂')
{'₂', '᧚'}
>>> checker('3᧚℘₂')
{'3', '₂'}
>>> checker("a$%@#%\n")
{'@', '#', '\n', '
待改进，因为它既不检查保留名称，也不告诉任何关于为什么 ᧚ 有时会出现问题，而 ϔ 总是有问题......但这是我的没有-正则表达式方法。

 我的用你的的方式回答：
if not checker(i):
    return True
else:
    return False

可以归纳为
return not checker(i)

, '%'}

待改进，因为它既不检查保留名称，也不告诉任何关于为什么 ᧚ 有时会出现问题，而 ϔ 总是有问题......但这是我的没有-正则表达式方法。

我的用你的的方式回答：

可以归纳为

^{The question is about regex, so my answer may look out of subject. The point is that regex is simply not the right approach.}

Interested in getting the problematic characters ?

Using str.isidentifier, one can perform the check character by character, prefixing them with, say, an underscore to avoid false positive such as digits and so on... How could a name be valid if one of its (prefixed) component is not (?) E.g.

def checker(str_: str) -> 'set[str]':
    return {
        c for i, c in enumerate(str_)
        if not (f'_{c}' if i else c).isidentifier()
    }

>>> checker('℘3᧚₂')
{'₂'}

Which solution deals with unauthorised first characters, such as digits or e.g. ᧚. See

>>> checker('᧚℘3₂')
{'₂', '᧚'}
>>> checker('3᧚℘₂')
{'3', '₂'}
>>> checker("a$%@#%\n")
{'@', '#', '\n', '
To be improved, since it does check neither for reserved names, nor tells anything about why ᧚ is sometime problematic, whereas ₂ always is... but here is my without-regex approach.

My answer in your terms:
if not checker(i):
    return True
else:
    return False

which could be contracted into
return not checker(i)

, '%'}

To be improved, since it does check neither for reserved names, nor tells anything about why ᧚ is sometime problematic, whereas ₂ always is... but here is my without-regex approach.

My answer in your terms:

which could be contracted into

回复收藏 0 原文

開玄 2024-11-03 21:35:02

我需要一个有效的正则表达式（即我不能只使用 str.isidentifier），因为我需要查找嵌入在字符串中的所有标识符，而不仅仅是测试整个字符串是否是有效标识符。我也无法使用 ast 模块，因为我预计该字符串不是有效的 Python 语法。所以现有的答案没有帮助，我对“使用正则表达式包”不满意。因此，这里是一个实际的正则表达式，可以完成这项工作，以及构建它和测试它的代码。

# coding: utf-8

import itertools

import re
full_pattern = r"[A-Z_a-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶ-ͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙՠ-ֈא-תׯ-ײؠ-يٮ-ٯٱ-ۓەۥ-ۦۮ-ۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴ-ߵߺࠀ-ࠕࠚࠤࠨࡀ-ࡘࡠ-ࡪࢠ-ࢴࢶ-ࣇऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএ-ঐও-নপ-রলশ-হঽৎড়-ঢ়য়-ৡৰ-ৱৼਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હઽૐૠ-ૡૹଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହଽଡ଼-ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-హఽౘ-ౚౠ-ౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠ-ೡೱ-ೲഄ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะาเ-ๆກ-ຂຄຆ-ຊຌ-ຣລວ-ະາຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥ-ၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛮ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡸᢀ-ᢨᢪᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮ-ᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿᳩ-ᳬᳮ-ᳳᳵ-ᳶᳺᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕ℘-ℝℤΩℨK-ℹℼ-ℿⅅ-ⅉⅎⅠ-ↈⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲ-ⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞ々-〇〡-〩〱-〵〸-〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆿㇰ-ㇿ㐀-䶿一-鿼ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪ-ꘫꙀ-ꙮꙿ-ꚝꚠ-ꛯꜗ-ꜟꜢ-ꞈꞋ-ꞿꟂ-ꟊꟵ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽ-ꣾꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵ-ꪶꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭩꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ﬀ-ﬆﬓ-ﬗיִײַ-ﬨשׁ-זּטּ-לּמּנּ-סּףּ-פּצּ-ﮱﯓ-ﱝﱤ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷹﹱﹳﹷﹹﹻﹽﹿ-ﻼＡ-Ｚａ-ｚｦ-ﾝﾠ-ﾾￂ-ￇￊ-ￏￒ-ￗￚ-ￜ

I needed a working regex (i.e. I couldn't just use str.isidentifier) because I needed to find all identifiers embedded in a string, not just test if a whole string was a valid identifier. I also couldn't use the ast module because I expected the string to not be valid Python syntax. So the existing answers didn't help, and I wasn't satisfied with 'use the regex package'. So here's an actual regex that does the job, along with the code for constructing it and testing it.

# coding: utf-8
import itertools
import re

full_pattern = r"[A-Z_a-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶ-ͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙՠ-ֈא-תׯ-ײؠ-يٮ-ٯٱ-ۓەۥ-ۦۮ-ۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴ-ߵߺࠀ-ࠕࠚࠤࠨࡀ-ࡘࡠ-ࡪࢠ-ࢴࢶ-ࣇऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএ-ঐও-নপ-রলশ-হঽৎড়-ঢ়য়-ৡৰ-ৱৼਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હઽૐૠ-ૡૹଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହଽଡ଼-ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-హఽౘ-ౚౠ-ౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠ-ೡೱ-ೲഄ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะาเ-ๆກ-ຂຄຆ-ຊຌ-ຣລວ-ະາຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥ-ၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛮ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡸᢀ-ᢨᢪᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮ-ᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿᳩ-ᳬᳮ-ᳳᳵ-ᳶᳺᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕ℘-ℝℤΩℨK-ℹℼ-ℿⅅ-ⅉⅎⅠ-ↈⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲ-ⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞ々-〇〡-〩〱-〵〸-〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆿㇰ-ㇿ㐀-䶿一-鿼ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪ-ꘫꙀ-ꙮꙿ-ꚝꚠ-ꛯꜗ-ꜟꜢ-ꞈꞋ-ꞿꟂ-ꟊꟵ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽ-ꣾꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵ-ꪶꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭩꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ﬀ-ﬆﬓ-ﬗיִײַ-ﬨשׁ-זּטּ-לּמּנּ-סּףּ-פּצּ-ﮱﯓ-ﱝﱤ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷹﹱﹳﹷﹹﹻﹽﹿ-ﻼＡ-Ｚａ-ｚｦ-ﾝﾠ-ﾾￂ-ￇￊ-ￏￒ-ￗￚ-ￜ????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????????????-????????????-????????-????????????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????????-????????????-????????-????????-????????-????????????-????????-????????-????????????????-????????-????????????????????-????????????????-????????????-????????-????????-????????????-????????-????????-????????-????????????-????????-????????-????????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????????-????????-????????-????????????????-????????-????????????????????????????????-????????-????????????????????????????????-????????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????][0-9A-Z_a-zªµ·ºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮ̀-ʹͶ-ͷͻ-ͽͿΆ-ΊΌΎ-ΡΣ-ϵϷ-ҁ҃-҇Ҋ-ԯԱ-Ֆՙՠ-ֈ֑-ֽֿׁ-ׂׄ-ׇׅא-תׯ-ײؐ-ؚؠ-٩ٮ-ۓە-ۜ۟-۪ۨ-ۼۿܐ-݊ݍ-ޱ߀-ߵߺ߽ࠀ-࠭ࡀ-࡛ࡠ-ࡪࢠ-ࢴࢶ-ࣇ࣓-ࣣ࣡-ॣ०-९ॱ-ঃঅ-ঌএ-ঐও-নপ-রলশ-হ়-ৄে-ৈো-ৎৗড়-ঢ়য়-ৣ০-ৱৼ৾ਁ-ਃਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹ਼ਾ-ੂੇ-ੈੋ-੍ੑਖ਼-ੜਫ਼੦-ੵઁ-ઃઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હ઼-ૅે-ૉો-્ૐૠ-ૣ૦-૯ૹ-૿ଁ-ଃଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହ଼-ୄେ-ୈୋ-୍୕-ୗଡ଼-ଢ଼ୟ-ୣ୦-୯ୱஂ-ஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹா-ூெ-ைொ-்ௐௗ௦-௯ఀ-ఌఎ-ఐఒ-నప-హఽ-ౄె-ైొ-్ౕ-ౖౘ-ౚౠ-ౣ౦-౯ಀ-ಃಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹ಼-ೄೆ-ೈೊ-್ೕ-ೖೞೠ-ೣ೦-೯ೱ-ೲഀ-ഌഎ-ഐഒ-ൄെ-ൈൊ-ൎൔ-ൗൟ-ൣ൦-൯ൺ-ൿඁ-ඃඅ-ඖක-නඳ-රලව-ෆ්ා-ුූෘ-ෟ෦-෯ෲ-ෳก-ฺเ-๎๐-๙ກ-ຂຄຆ-ຊຌ-ຣລວ-ຽເ-ໄໆ່-ໍ໐-໙ໜ-ໟༀ༘-༙༠-༩༹༵༷༾-ཇཉ-ཬཱ-྄྆-ྗྙ-ྼ࿆က-၉ၐ-ႝႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚ፝-፟፩-፱ᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛮ-ᛸᜀ-ᜌᜎ-᜔ᜠ-᜴ᝀ-ᝓᝠ-ᝬᝮ-ᝰᝲ-ᝳក-៓ៗៜ-៝០-៩᠋-᠍᠐-᠙ᠠ-ᡸᢀ-ᢪᢰ-ᣵᤀ-ᤞᤠ-ᤫᤰ-᤻᥆-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉ᧐-᧚ᨀ-ᨛᨠ-ᩞ᩠-᩿᩼-᪉᪐-᪙ᪧ᪰-᪽ᪿ-ᫀᬀ-ᭋ᭐-᭙᭫-᭳ᮀ-᯳ᰀ-᰷᱀-᱉ᱍ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿ᳐-᳔᳒-ᳺᴀ-᷹᷻-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼ‿-⁀⁔ⁱⁿₐ-ₜ⃐-⃥⃜⃡-⃰ℂℇℊ-ℓℕ℘-ℝℤΩℨK-ℹℼ-ℿⅅ-ⅉⅎⅠ-ↈⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯ⵿-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⷠ-ⷿ々-〇〡-〯〱-〵〸-〼ぁ-ゖ゙-゚ゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆿㇰ-ㇿ㐀-䶿一-鿼ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘫꙀ-꙯ꙴ-꙽ꙿ-꛱ꜗ-ꜟꜢ-ꞈꞋ-ꞿꟂ-ꟊꟵ-ꠧ꠬ꡀ-ꡳꢀ-ꣅ꣐-꣙꣠-ꣷꣻꣽ-꤭ꤰ-꥓ꥠ-ꥼꦀ-꧀ꧏ-꧙ꧠ-ꧾꨀ-ꨶꩀ-ꩍ꩐-꩙ꩠ-ꩶꩺ-ꫂꫛ-ꫝꫠ-ꫯꫲ-꫶ꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭩꭰ-ꯪ꯬-꯭꯰-꯹가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ﬀ-ﬆﬓ-ﬗיִ-ﬨשׁ-זּטּ-לּמּנּ-סּףּ-פּצּ-ﮱﯓ-ﱝﱤ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷹ︀-️︠-︯︳-︴﹍-﹏ﹱﹳﹷﹹﹻﹽﹿ-ﻼ０-９Ａ-Ｚ＿ａ-ｚｦ-ﾾￂ-ￇￊ-ￏￒ-ￗￚ-ￜ????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????????-????????-????????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????????????-????????-????????????????????????????????-????????-????????????????????????????????-????????????-????????-????????-????????-????????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????????-????]*"


def chars():
    for i in itertools.count():
        try:
            yield chr(i)
        except ValueError:
            break


def make_full_pattern():
    def make_pattern(is_valid):
        pattern = ""

        for is_identifier, group in itertools.groupby(chars(), is_valid):
            if is_identifier:
                group = list(group)
                if len(group) == 1:
                    pattern += group[0]
                else:
                    pattern += group[0] + "-" + group[-1]

        return "[" + pattern + "]"

    return make_pattern(str.isidentifier) + make_pattern(lambda c: ("x" + c).isidentifier()) + "*"


def test_pattern():
    assert full_pattern == make_full_pattern()
    identifier_regex = re.compile(full_pattern)

    for char in chars():
        for string in [char, "x" + char]:
            assert bool(identifier_regex.fullmatch(string)) == string.isidentifier()


test_pattern()

回复收藏 0 原文

~没有更多了~

关于作者

完美的未来在梦里

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

正则表达式来确认字符串是否是有效的Python标识符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

为什么会发生这种情况？

这两套有什么不同呢？

这一切都记录在哪里？

在哪里实现的？

想要一个正则表达式

Why does this happen?

What's different about these two sets?

Where is this all documented?

Where is it implemented?

I still want a regular expression

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

正则表达式来确认字符串是否是有效的Python标识符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

为什么会发生这种情况？

这两套有什么不同呢？

这一切都记录在哪里？

在哪里实现的？

想要一个正则表达式

Why does this happen?

What's different about these two sets?

Where is this all documented?

Where is it implemented?

I still want a regular expression

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。