Python3.0 - 标记化和取消标记化
我正在使用类似于以下简化脚本的内容来从较大的文件中解析 python 片段:
import io
import tokenize
src = 'foo="bar"'
src = bytes(src.encode())
src = io.BytesIO(src)
src = list(tokenize.tokenize(src.readline))
for tok in src:
print(tok)
src = tokenize.untokenize(src)
虽然 python2.x 中的代码不同,但它使用相同的习惯用法并且工作得很好。 但是,使用 python3.0 运行上面的代码片段,我得到以下输出:
(57, 'utf-8', (0, 0), (0, 0), '')
(1, 'foo', (1, 0), (1, 3), 'foo="bar"')
(53, '=', (1, 3), (1, 4), 'foo="bar"')
(3, '"bar"', (1, 4), (1, 9), 'foo="bar"')
(0, '', (2, 0), (2, 0), '')
Traceback (most recent call last):
File "q.py", line 13, in <module>
src = tokenize.untokenize(src)
File "/usr/local/lib/python3.0/tokenize.py", line 236, in untokenize
out = ut.untokenize(iterable)
File "/usr/local/lib/python3.0/tokenize.py", line 165, in untokenize
self.add_whitespace(start)
File "/usr/local/lib/python3.0/tokenize.py", line 151, in add_whitespace
assert row <= self.prev_row
AssertionError
我已经搜索了对此错误及其原因的引用,但无法找到任何内容。 我做错了什么以及如何纠正?
[编辑]
在 partisann 观察到向源添加换行符会导致错误消失后,我开始搞乱我正在取消标记的列表。 如果 EOF 标记前面没有紧接着换行符,似乎会导致错误,因此删除它可以消除错误。 以下脚本运行没有错误:
import io
import tokenize
src = 'foo="bar"'
src = bytes(src.encode())
src = io.BytesIO(src)
src = list(tokenize.tokenize(src.readline))
for tok in src:
print(tok)
src = tokenize.untokenize(src[:-1])
I am using something similar to the following simplified script to parse snippets of python from a larger file:
import io
import tokenize
src = 'foo="bar"'
src = bytes(src.encode())
src = io.BytesIO(src)
src = list(tokenize.tokenize(src.readline))
for tok in src:
print(tok)
src = tokenize.untokenize(src)
Although the code is not the same in python2.x, it uses the same idiom and works just fine. However, running the above snippet using python3.0, I get this output:
(57, 'utf-8', (0, 0), (0, 0), '')
(1, 'foo', (1, 0), (1, 3), 'foo="bar"')
(53, '=', (1, 3), (1, 4), 'foo="bar"')
(3, '"bar"', (1, 4), (1, 9), 'foo="bar"')
(0, '', (2, 0), (2, 0), '')
Traceback (most recent call last):
File "q.py", line 13, in <module>
src = tokenize.untokenize(src)
File "/usr/local/lib/python3.0/tokenize.py", line 236, in untokenize
out = ut.untokenize(iterable)
File "/usr/local/lib/python3.0/tokenize.py", line 165, in untokenize
self.add_whitespace(start)
File "/usr/local/lib/python3.0/tokenize.py", line 151, in add_whitespace
assert row <= self.prev_row
AssertionError
I have searched for references to this error and its causes, but have been unable to find any. What am I doing wrong and how can I correct it?
[edit]
After partisann's observation that appending a newline to the source causes the error to go away, I started messing with the list I was untokenizing. It seems that the EOF
token causes an error if not immediately preceded by a newline so removing it gets rid of the error. The following script runs without error:
import io
import tokenize
src = 'foo="bar"'
src = bytes(src.encode())
src = io.BytesIO(src)
src = list(tokenize.tokenize(src.readline))
for tok in src:
print(tok)
src = tokenize.untokenize(src[:-1])
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
You forgot newline.
You forgot newline.
如果将
untokenize
的输入限制为令牌的前 2 项,它似乎可以工作。If you limit the input to
untokenize
to the first 2 items of the tokens, it seems to work.