如何处理未终止的多行注释的标记化错误（python 2.6）

发布于 2024-08-09 04:10:34 字数 879 浏览 9 评论 0原文

以下示例代码：

import token, tokenize, StringIO

def generate_tokens(src):
    rawstr = StringIO.StringIO(unicode(src))
    tokens = tokenize.generate_tokens(rawstr.readline)
    for i, item in enumerate(tokens):
        toktype, toktext, (srow,scol), (erow,ecol), line = item
        print i, token.tok_name[toktype], toktext

s = \
"""
 def test(x):
     \"\"\" test with an unterminated docstring
"""

generate_tokens(s)

导致以下事件触发：

... (stripped a little)
File "/usr/lib/python2.6/tokenize.py", line 296, in generate_tokens
    raise TokenError, ("EOF in multi-line string", strstart)
tokenize.TokenError: ('EOF in multi-line string', (3, 5))

有关此行为的一些问题：

我应该在此处捕获并“选择性”忽略 tokenize.TokenError 吗？或者我应该停止尝试从不合规/不完整的代码生成令牌吗？如果是这样，我该如何检查？
此错误（或类似错误）是否是由除未终止的文档字符串？

原文

The following sample code:

import token, tokenize, StringIO

def generate_tokens(src):
    rawstr = StringIO.StringIO(unicode(src))
    tokens = tokenize.generate_tokens(rawstr.readline)
    for i, item in enumerate(tokens):
        toktype, toktext, (srow,scol), (erow,ecol), line = item
        print i, token.tok_name[toktype], toktext

s = \
"""
 def test(x):
     \"\"\" test with an unterminated docstring
"""

generate_tokens(s)

causes the following to fire:

... (stripped a little)
File "/usr/lib/python2.6/tokenize.py", line 296, in generate_tokens
    raise TokenError, ("EOF in multi-line string", strstart)
tokenize.TokenError: ('EOF in multi-line string', (3, 5))

Some questions about this behaviour:

Should I catch and 'selectively' ignore tokenize.TokenError here? Or
should I stop trying to generate tokens from non-compliant/non-complete code? If so, how would I check for that?
Can this error (or similar errors) be caused by anything other than an
unterminated docstring?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娇妻 2024-08-16 04:10:34

如何处理标记化错误完全取决于您进行标记化的原因。您的代码会为您提供所有有效的标记，直到错误字符串文字的开头。如果该令牌流对您有用，请使用它。

对于如何处理该错误，您有几种选择：

您可以忽略它并获得不完整的令牌流。
您可以缓冲所有令牌，并且仅在没有发生错误时才使用令牌流。
您可以
您可以处理令牌，但如果发生错误，则中止更高级别的处理。
您可以处理令牌

至于除了不完整的文档字符串之外的任何其他情况是否会发生该错误，是的。请记住，文档字符串只是字符串文字。任何未终止的多行字符串文字都会给您带来相同的错误。代码中的其他词法错误也可能发生类似的错误。

例如，以下是产生错误的 s 的其他值（至少在 Python 2.5 中）：

s = ")"  # EOF in multi-line statement
s = "("  # EOF in multi-line statement
s = "]"  # EOF in multi-line statement
s = "["  # EOF in multi-line statement
s = "}"  # EOF in multi-line statement
s = "{"  # EOF in multi-line statement

奇怪的是，其他无意义的输入反而产生 ERRORTOKEN 值：

s = "$"
s = "'"

How you handle tokenize errors depends entirely on why you are tokenizing. You code gives you all the valid tokens up until the beginning of the bad string literal. If that token stream is useful to you, then use it.

You have a few options about what to do with the error:

You could ignore it and have an incomplete token stream.
You could buffer all the tokens and only use the token stream if no error occurred.
You could process the tokens, but abort the higher-level processing if an error occurred.

As to whether that error can happen with anything other than an incomplete docstring, yes. Remember that docstrings are just string literals. Any unterminated multi-line string literal will give you the same error. Similar errors could happen for other lexical errors in the code.

For example, here are other values of s that produce errors (at least with Python 2.5):

s = ")"  # EOF in multi-line statement
s = "("  # EOF in multi-line statement
s = "]"  # EOF in multi-line statement
s = "["  # EOF in multi-line statement
s = "}"  # EOF in multi-line statement
s = "{"  # EOF in multi-line statement

Oddly, other nonsensical inputs produce ERRORTOKEN values instead: