标记化模块

发布于 2024-07-20 04:26:10 字数 839 浏览 9 评论 0原文

请帮忙

模块 tokenize 中有很多标记，如 STRING、BACKQUOTE、AMPEREQUAL 等。

>>> import cStringIO
>>> import tokenize
>>> source = "{'test':'123','hehe':['hooray',0x10]}"
>>> src = cStringIO.StringIO(source).readline
>>> src = tokenize.generate_tokens(src)
>>> src
<generator object at 0x00BFBEE0>
>>> src.next()
(51, '{', (1, 0), (1, 1), "{'test':'123','hehe':['hooray',0x10]}")
>>> token = src.next()
>>> token
(3, "'test'", (1, 1), (1, 7), "{'test':'123','hehe':['hooray',0x10]}")
>>> token[0]
3
>>> tokenize.STRING
3
>>> tokenize.AMPER
19
>>> tokenize.AMPEREQUAL
42
>>> tokenize.AT
50
>>> tokenize.BACKQUOTE
25

这是我实验过的。但我无法找到它们的含义？

从哪里我会明白这一点。我需要一个立即的解决方案。

原文

Please help

There are many tokens in module tokenize like STRING,BACKQUOTE,AMPEREQUAL etc.

>>> import cStringIO
>>> import tokenize
>>> source = "{'test':'123','hehe':['hooray',0x10]}"
>>> src = cStringIO.StringIO(source).readline
>>> src = tokenize.generate_tokens(src)
>>> src
<generator object at 0x00BFBEE0>
>>> src.next()
(51, '{', (1, 0), (1, 1), "{'test':'123','hehe':['hooray',0x10]}")
>>> token = src.next()
>>> token
(3, "'test'", (1, 1), (1, 7), "{'test':'123','hehe':['hooray',0x10]}")
>>> token[0]
3
>>> tokenize.STRING
3
>>> tokenize.AMPER
19
>>> tokenize.AMPEREQUAL
42
>>> tokenize.AT
50
>>> tokenize.BACKQUOTE
25

This is what i experimented.But i was not able to find what they mean ?

From where i will understand this.I need an immediate solution.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦初启 2024-07-27 04:26:10

各种 AMPER、BACKQUOTE 等值对应于 python 标记/运算符的相应符号的标记编号。即 AMPER = & （与符号），AMPEREQUAL =“&=”。

然而，您实际上不必关心这些。它们由内部 C 标记器使用，但 python 包装器简化了输出，将所有运算符符号转换为 OP 标记。您可以使用令牌模块的 tok_name 字典将符号令牌 id（每个令牌元组中的第一个值）转换为符号名称。例如：

>>> import tokenize, token
>>> s = "{'test':'123','hehe':['hooray',0x10]}"
>>> for t in tokenize.generate_tokens(iter([s]).next):
        print token.tok_name[t[0]],

OP STRING OP STRING OP STRING OP OP STRING OP NUMBER OP OP ENDMARKER

作为更好地描述令牌的快速调试语句，您还可以使用 tokenize.printtoken。这是没有文档记录的，并且看起来它不存在于 python3 中，因此不要依赖它来生成生产代码，但是快速浏览一下标记的含义，您可能会发现它很有用

>>> for t in tokenize.generate_tokens(iter([s]).next):
        tokenize.printtoken(*t)

1,0-1,1:        OP      '{'
1,1-1,7:        STRING  "'test'"
1,7-1,8:        OP      ':'
1,8-1,13:       STRING  "'123'"
1,13-1,14:      OP      ','
1,14-1,20:      STRING  "'hehe'"
1,20-1,21:      OP      ':'
1,21-1,22:      OP      '['
1,22-1,30:      STRING  "'hooray'"
1,30-1,31:      OP      ','
1,31-1,35:      NUMBER  '0x10'
1,35-1,36:      OP      ']'
1,36-1,37:      OP      '}'
2,0-2,0:        ENDMARKER       ''

：返回的每个标记依次为：

标记 Id（对应于类型，例如 STRING、OP、NAME 等）
字符串 - 该标记的实际标记文本，例如“&” 或“'a string'”
输入中的开始（行、列）
输入中的结束（行、列）
标记所在行的完整文本。

The various AMPER, BACKQUOTE etc values correspond to the token number of the appropriate symbol for python tokens / operators. ie AMPER = & (ampersand), AMPEREQUAL = "&=".

However, you don't actually have to care about these. They're used by the internal C tokeniser, but the python wrapper simplifies the output, translating all operator symbols to the OP token. You can translate the symbolic token ids (the first value in each token tuple) to the symbolic name using the token module's tok_name dictionary. For example:

>>> import tokenize, token
>>> s = "{'test':'123','hehe':['hooray',0x10]}"
>>> for t in tokenize.generate_tokens(iter([s]).next):
        print token.tok_name[t[0]],

OP STRING OP STRING OP STRING OP OP STRING OP NUMBER OP OP ENDMARKER

As a quick debug statement to describe the tokens a bit better, you could also use tokenize.printtoken. This is undocumented, and looks like it isn't present in python3, so don't rely on it for production code, but as a quick peek at what the tokens mean, you may find it useful:

>>> for t in tokenize.generate_tokens(iter([s]).next):
        tokenize.printtoken(*t)

1,0-1,1:        OP      '{'
1,1-1,7:        STRING  "'test'"
1,7-1,8:        OP      ':'
1,8-1,13:       STRING  "'123'"
1,13-1,14:      OP      ','
1,14-1,20:      STRING  "'hehe'"
1,20-1,21:      OP      ':'
1,21-1,22:      OP      '['
1,22-1,30:      STRING  "'hooray'"
1,30-1,31:      OP      ','
1,31-1,35:      NUMBER  '0x10'
1,35-1,36:      OP      ']'
1,36-1,37:      OP      '}'
2,0-2,0:        ENDMARKER       ''

The various values in the tuple you get back for each token are, in order:

token Id (corresponds to the type, eg STRING, OP, NAME etc)
The string - the actual token text for this token, eg "&" or "'a string'"
The start (line, column) in your input
The end (line, column) in your input
The full text of the line the token is on.

回复收藏 0 原文

往日 2024-07-27 04:26:10

您需要阅读 python 的代码 tokenizer .c 了解细节。
只需搜索您想了解的关键字即可。应该不难。

回复收藏 0 原文

最偏执的依靠 2024-07-27 04:26:10

Python 的词法分析（包括标记）记录在 http://docs.python.org/reference/ lexical_analysis.html 。作为 http://docs.python.org/library/token.html#module -token 表示：“请参阅 Python 发行版中的文件 Grammar/Grammar，了解语言语法上下文中名称的定义。”。

回复收藏 0 原文

~没有更多了~