如何在 Python 2.6 中获得 Python isidentifer() 功能?

发布于 2024-08-26 17:37:34 字数 206 浏览 17 评论 0原文

Python 3 有一个名为 str.isidentifier 的字符串方法

如何在 Python 2.6 中获得类似的功能,而不需要重写我自己的正则表达式等?

Python 3 has a string method called str.isidentifier

How can I get similar functionality in Python 2.6, short of rewriting my own regex, etc.?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

蘑菇王子 2024-09-02 17:37:34

tokenize 模块定义了一个名为 Name 的正则表达式

import re, tokenize, keyword
re.match(tokenize.Name + '
, somestr) and not keyword.iskeyword(somestr)

the tokenize module defines a regexp called Name

import re, tokenize, keyword
re.match(tokenize.Name + '
, somestr) and not keyword.iskeyword(somestr)
耶耶耶 2024-09-02 17:37:34

无效标识符验证


该线程中的所有答案似乎都在验证中重复一个错误,该错误允许将不是有效标识符的字符串与字符串进行匹配。

其他答案中建议的正则表达式模式是从 tokenize.Name 构建的,它包含以下正则表达式模式 [a-zA-Z_]\w* (运行 python 2.7.15 ) 和 '$' 正则表达式锚点。

请参考标识符和关键字的官方Python 3描述(其中也包含与 python 2 相关的段落)。

在 ASCII 范围 (U+0001..U+007F) 内,标识符的有效字符与 Python 2.x 中的相同:大写和小写字母 A 到 Z、下划线 _ 和,但除外第一个字符,数字 0 到 9。

因此“foo\n”不应被视为有效标识符。

虽然有人可能会认为这段代码是有效的:

>>>  class Foo():
>>>     pass
>>> f = Foo()
>>> setattr(f, 'foo\n', 'bar')
>>> dir(f)
['__doc__', '__module__', 'foo\n']
>>> print getattr(f, 'foo\n')
bar

由于换行符确实是有效的 ASCII 字符,因此它不被视为字母。此外,以换行符结尾的标识符显然没有实际用途

>>> f.foo\n
SyntaxError: unexpected character after line continuation character

str.isidentifier 函数还确认这是一个无效标识符:

python3 解释器:

>>> print('foo\n'.isidentifier())
False

< code>$ 锚点与 \Z 锚点


引用 官方 python2 正则表达式语法

<代码>$

匹配字符串的末尾或字符串末尾的换行符之前,并且在 MULTILINE 模式下还匹配换行符之前。 foo 同时匹配 'foo' 和 'foobar',而正则表达式 foo$ 仅匹配 'foo'。更有趣的是,在 'foo1\nfoo2\n' 中搜索 foo.$ 通常匹配 'foo2',但在 MULTILINE 模式下搜索 'foo1';在 'foo\n' 中搜索单个 $ 将找到两个(空)匹配项:一个位于换行符之前,另一个位于字符串末尾。

这会生成一个以换行符结尾的字符串进行匹配作为有效标识符:

>>> import tokenize
>>> import re
>>> re.match(tokenize.Name + '

正则表达式模式不应使用 $ 锚点,而应使用 \Z 锚点。
再次引用一下:

<代码>\Z

仅在字符串末尾匹配。

现在正则表达式是有效的:

>>> re.match(tokenize.Name + r'\Z', 'foo\n') is None
True

危险含义


请参阅卢克的回答是另一个例子,这种弱正则表达式匹配在其他情况下可能会产生更危险的影响。

进一步阅读


Python 3 添加了对非 ascii 标识符的支持,请参阅 PEP-3131

, 'foo\n') <_sre.SRE_Match at 0x3eac8e0> >>> print m.group() 'foo'

正则表达式模式不应使用 $ 锚点,而应使用 \Z 锚点。
再次引用一下:

<代码>\Z

仅在字符串末尾匹配。

现在正则表达式是有效的:

危险含义


请参阅卢克的回答是另一个例子,这种弱正则表达式匹配在其他情况下可能会产生更危险的影响。

进一步阅读


Python 3 添加了对非 ascii 标识符的支持,请参阅 PEP-3131

Invalid Identifier Validation


All of the answers in this thread seem to be repeating a mistake in the validation which allows strings that are not valid identifiers to be matched like ones.

The regex patterns suggested in the other answers are built from tokenize.Name which holds the following regex pattern [a-zA-Z_]\w* (running python 2.7.15) and the '$' regex anchor.

Please refer to the official python 3 description of the identifiers and keywords (which contains a paragraph that is relevant to python 2 as well).

Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.

thus 'foo\n' should not be considered as a valid identifier.

While one may argue that this code is functional:

>>>  class Foo():
>>>     pass
>>> f = Foo()
>>> setattr(f, 'foo\n', 'bar')
>>> dir(f)
['__doc__', '__module__', 'foo\n']
>>> print getattr(f, 'foo\n')
bar

As the newline character is indeed a valid ASCII character, it is not considered to be a letter. Further more, there is clearly no practical use of an identifer that ends with a newline character

>>> f.foo\n
SyntaxError: unexpected character after line continuation character

The str.isidentifier function also confirms this is an invalid identifier:

python3 interpreter:

>>> print('foo\n'.isidentifier())
False

The $ anchor vs the \Z anchor


Quoting the official python2 Regular Expression syntax:

$

Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.

This results in a string that ends with a newline to match as a valid identifier:

>>> import tokenize
>>> import re
>>> re.match(tokenize.Name + '

The regex pattern should not use the $ anchor but instead \Z is the anchor that should be used.
Quoting once again:

\Z

Matches only at the end of the string.

And now the regex is a valid one:

>>> re.match(tokenize.Name + r'\Z', 'foo\n') is None
True

Dangerous Implications


See Luke's answer for another example how this kind of weak regex matching could potentially in other circumstances have more dangerous implications.

Further Reading


Python 3 added support for non-ascii identifiers see PEP-3131.

, 'foo\n') <_sre.SRE_Match at 0x3eac8e0> >>> print m.group() 'foo'

The regex pattern should not use the $ anchor but instead \Z is the anchor that should be used.
Quoting once again:

\Z

Matches only at the end of the string.

And now the regex is a valid one:

Dangerous Implications


See Luke's answer for another example how this kind of weak regex matching could potentially in other circumstances have more dangerous implications.

Further Reading


Python 3 added support for non-ascii identifiers see PEP-3131.

我为君王 2024-09-02 17:37:34
re.match(r'[a-z_]\w*

应该做得很好。据我所知,没有任何内置方法。

, s, re.I)

应该做得很好。据我所知,没有任何内置方法。

re.match(r'[a-z_]\w*

should do nicely. As far as I know there isn't any built-in method.

, s, re.I)

should do nicely. As far as I know there isn't any built-in method.

來不及說愛妳 2024-09-02 17:37:34

到目前为止很好的答案。我会这样写。

import keyword
import re

def isidentifier(candidate):
    "Is the candidate string an identifier in Python 2.x"
    is_not_keyword = candidate not in keyword.kwlist
    pattern = re.compile(r'^[a-z_][a-z0-9_]*
, re.I)
    matches_pattern = bool(pattern.match(candidate))
    return is_not_keyword and matches_pattern

Good answers so far. I'd write it like this.

import keyword
import re

def isidentifier(candidate):
    "Is the candidate string an identifier in Python 2.x"
    is_not_keyword = candidate not in keyword.kwlist
    pattern = re.compile(r'^[a-z_][a-z0-9_]*
, re.I)
    matches_pattern = bool(pattern.match(candidate))
    return is_not_keyword and matches_pattern
倒带 2024-09-02 17:37:34

在Python中< 3.0 这很容易,因为标识符中不能包含 unicode 字符。那应该可以完成工作:

import re
import keyword

def isidentifier(s):
    if s in keyword.kwlist:
        return False
    return re.match(r'^[a-z_][a-z0-9_]*
, s, re.I) is not None

In Python < 3.0 this is quite easy, as you can't have unicode characters in identifiers. That should do the work:

import re
import keyword

def isidentifier(s):
    if s in keyword.kwlist:
        return False
    return re.match(r'^[a-z_][a-z0-9_]*
, s, re.I) is not None
那支青花 2024-09-02 17:37:34

我决定再次尝试一下,因为已经有一些很好的建议。我会尽力整合它们。以下内容可以保存为 Python 模块并直接从命令行运行。如果运行,它会测试该功能,因此可以证明是正确的(至少在文档演示该功能的范围内)。

import keyword
import re
import tokenize

def isidentifier(candidate):
    """
    Is the candidate string an identifier in Python 2.x
    Return true if candidate is an identifier.
    Return false if candidate is a string, but not an identifier.
    Raises TypeError when candidate is not a string.

    >>> isidentifier('foo')
    True

    >>> isidentifier('print')
    False

    >>> isidentifier('Print')
    True

    >>> isidentifier(u'Unicode_type_ok')
    True

    # unicode symbols are not allowed, though.
    >>> isidentifier(u'Unicode_content_\u00a9')
    False

    >>> isidentifier('not')
    False

    >>> isidentifier('re')
    True

    >>> isidentifier(object)
    Traceback (most recent call last):
    ...
    TypeError: expected string or buffer
    """
    # test if candidate is a keyword
    is_not_keyword = candidate not in keyword.kwlist
    # create a pattern based on tokenize.Name
    pattern_text = '^{tokenize.Name}
.format(**globals())
    # compile the pattern
    pattern = re.compile(pattern_text)
    # test whether the pattern matches
    matches_pattern = bool(pattern.match(candidate))
    # return true only if the candidate is not a keyword and the pattern matches
    return is_not_keyword and matches_pattern

def test():
    import unittest
    import doctest
    suite = unittest.TestSuite()
    suite.addTest(doctest.DocTestSuite())
    runner = unittest.TextTestRunner()
    runner.run(suite)

if __name__ == '__main__':
    test()

I've decided to take another crack at this, since there have been several good suggestions. I'll try to consolidate them. The following can be saved as a Python module and run directly from the command-line. If run, it tests the function, so is provably correct (at least to the extent that the documentation demonstrates the capability).

import keyword
import re
import tokenize

def isidentifier(candidate):
    """
    Is the candidate string an identifier in Python 2.x
    Return true if candidate is an identifier.
    Return false if candidate is a string, but not an identifier.
    Raises TypeError when candidate is not a string.

    >>> isidentifier('foo')
    True

    >>> isidentifier('print')
    False

    >>> isidentifier('Print')
    True

    >>> isidentifier(u'Unicode_type_ok')
    True

    # unicode symbols are not allowed, though.
    >>> isidentifier(u'Unicode_content_\u00a9')
    False

    >>> isidentifier('not')
    False

    >>> isidentifier('re')
    True

    >>> isidentifier(object)
    Traceback (most recent call last):
    ...
    TypeError: expected string or buffer
    """
    # test if candidate is a keyword
    is_not_keyword = candidate not in keyword.kwlist
    # create a pattern based on tokenize.Name
    pattern_text = '^{tokenize.Name}
.format(**globals())
    # compile the pattern
    pattern = re.compile(pattern_text)
    # test whether the pattern matches
    matches_pattern = bool(pattern.match(candidate))
    # return true only if the candidate is not a keyword and the pattern matches
    return is_not_keyword and matches_pattern

def test():
    import unittest
    import doctest
    suite = unittest.TestSuite()
    suite.addTest(doctest.DocTestSuite())
    runner = unittest.TextTestRunner()
    runner.run(suite)

if __name__ == '__main__':
    test()
睫毛溺水了 2024-09-02 17:37:34

我正在使用什么:

def is_valid_keyword_arg(k):
    """
    Return True if the string k can be used as the name of a valid
    Python keyword argument, otherwise return False.
    """
    # Don't allow python reserved words as arg names
    if k in keyword.kwlist:
        return False
    return re.match('^' + tokenize.Name + '
, k) is not None

What I am using:

def is_valid_keyword_arg(k):
    """
    Return True if the string k can be used as the name of a valid
    Python keyword argument, otherwise return False.
    """
    # Don't allow python reserved words as arg names
    if k in keyword.kwlist:
        return False
    return re.match('^' + tokenize.Name + '
, k) is not None
不喜欢何必死缠烂打 2024-09-02 17:37:34

到目前为止提出的所有解决方案都不支持 Unicode,或者如果在 Python 3 上运行,则不允许第一个字符中包含数字。

编辑:建议的解决方案只能在 Python 2 上使用,在 Python3 上应该使用 isidentifier。这是一个应该在任何地方都适用的解决方案:

re.match(r'^\w+

基本上,它测试某些内容是否由(至少 1)个字符(包括数字)组成,然后检查第一个字符是否不是数字。

, name, re.UNICODE) and not name[0].isdigit()

基本上,它测试某些内容是否由(至少 1)个字符(包括数字)组成,然后检查第一个字符是否不是数字。

All solutions proposed so far do not support Unicode or allow a number in the first char if run on Python 3.

Edit: the proposed solutions should only be used on Python 2, and on Python3 isidentifier should be used. Here is a solution that should work anywhere:

re.match(r'^\w+

Basically, it tests whether something consists of (at least 1) characters (including numbers), and then it checks that the first char is not a number.

, name, re.UNICODE) and not name[0].isdigit()

Basically, it tests whether something consists of (at least 1) characters (including numbers), and then it checks that the first char is not a number.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文