Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 months ago.
The community reviewed whether to reopen this question 9 months ago and left it closed:
Original close reason(s) were not resolved
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(12)
当我开始使用 Python 进行新操作时,我通常会首先查看要使用的一些模块或库。 有 90% 以上的可能性已经有可用的东西。
对于分词器和解析器来说确实如此。 您看过 PyParsing 吗?
When I start something new in Python I usually look first at some modules or libraries to use. There's 90%+ chance that there already is somthing available.
For tokenizers and parsers this is certainly so. Have you looked at PyParsing ?
我已经为类似 C 的编程语言实现了一个分词器。 我所做的是将标记的创建分为两层:
两者都是发电机。 这种方法的好处是:
我对这种分层方法感到非常满意。
I've implemented a tokenizer for a C-like programming language. What I did was to split up the creation of tokens into two layers:
Both are generators. The benefits of this approach were:
I feel quite happy with this layered approach.
我会转向优秀的Python 中的文本处理作者:大卫·默茨
I'd turn to the excellent Text Processing in Python by David Mertz
“除了简单地返回元组列表之外,是否有更好的替代方案”
我必须实现一个标记生成器,但它需要比元组列表更复杂的方法,因此我为每个标记实现了一个类。 然后,您可以返回类实例的列表,或者如果您想节省资源,您可以返回实现迭代器接口的内容,并在解析过程中生成下一个标记。
"Is there a better alternative to just simply returning a list of tuples"
I had to implement a tokenizer, but it required a more complex approach than a list of tuples, therefore I implemented a class for each token. You can then return a list of class instances, or if you want to save resources, you can return something implementing the iterator interface and generate the next token while you progress in the parsing.
re
模块中有一个未记录的类,名为re.Scanner
。 使用标记器非常简单:将导致
我使用 re.Scanner 只需几百行即可编写一个非常漂亮的配置/结构化数据格式解析器。
There's an undocumented class in the
re
module calledre.Scanner
. It's very straightforward to use for a tokenizer:will result in
I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.
Python 采用“我们都是同意的成年人”的信息隐藏方法。 可以像使用常量一样使用变量,并且相信代码的用户不会做一些愚蠢的事情。
Python takes a "we're all consenting adults" approach to information hiding. It's OK to use variables as though they were constants, and trust that users of your code won't do something stupid.
在许多情况下,exp。 当解析长输入流时,您可能会发现将标记生成器实现为生成器函数更有用。 通过这种方式,您可以轻松地迭代所有标记,而不需要大量内存来首先构建标记列表。
对于生成器,请参阅原始提案或其他在线文档
In many situations, exp. when parsing long input streams, you may find it more useful to implement you tokenizer as a generator function. This way you can easily iterate over all the tokens without the need for lots of memory to build the list of tokens first.
For generator see the original proposal or other online docs
感谢您的帮助,我开始将这些想法整合在一起,并提出了以下建议。 这个实现有什么严重的错误吗(特别是我担心将文件对象传递给标记器):
Thanks for your help, I've started to bring these ideas together, and I've come up with the following. Is there anything terribly wrong with this implementation (particularly I'm concerned about passing a file object to the tokenizer):
“除了简单地返回元组列表之外,还有更好的选择吗?”
没有。 效果非常好。
"Is there a better alternative to just simply returning a list of tuples?"
Nope. It works really well.
“除了简单地返回元组列表之外,还有更好的选择吗?”
这就是“tokenize”模块用于解析 Python 源代码的方法。 返回一个简单的元组列表可以很好地工作。
"Is there a better alternative to just simply returning a list of tuples?"
That's the approach used by the "tokenize" module for parsing Python source code. Returning a simple list of tuples can work very well.
我最近也构建了一个分词器,并解决了您的一些问题。
令牌类型在模块级别被声明为“常量”,即名称全部大写的变量。 例如,
等等。 我在名称前面使用了下划线来指出这些字段对于模块来说是“私有的”,但我真的不知道这是典型的还是可取的,甚至不知道有多少Pythonic。 (另外,我可能会放弃数字而改用字符串,因为在调试过程中它们更具可读性。)
令牌以命名元组的形式返回。
我使用了命名元组,因为分词器的客户端代码(例如解析器)在使用名称(例如 token.value)而不是索引(例如 token[0])时看起来更清晰一些。
最后,我注意到有时,特别是编写测试时,我更喜欢将字符串而不是文件对象传递给标记生成器。 我将其称为“阅读器”,并有一个特定的方法来打开它并让分词器通过相同的接口访问它。
I have recently built a tokenizer, too, and passed through some of your issues.
Token types are declared as "constants", i.e. variables with ALL_CAPS names, at the module level. For example,
and so on. I have used an underscore in front of the name to point out that somehow those fields are "private" for the module, but I really don't know if this is typical or advisable, not even how much Pythonic. (Also, I'll probably ditch numbers in favour of strings, because during debugging they are much more readable.)
Tokens are returned as named tuples.
I have used named tuples because the tokenizer's client code (e.g. the parser) seems a little clearer while using names (e.g. token.value) instead of indexes (e.g. token[0]).
Finally, I've noticed that sometimes, especially writing tests, I prefer to pass a string to the tokenizer instead of a file object. I call it a "reader", and have a specific method to open it and let the tokenizer access it through the same interface.
这是一个迟到的答案,现在官方文档中有一些内容: 使用
re
标准库编写分词器。 这是 Python 3 文档中的内容,不在 Py 2.7 文档中。 但它仍然适用于较旧的 Python。这包括简短的代码、简单的设置以及编写生成器,正如这里提出的几个答案一样。
如果文档不是Pythonic,我不知道什么是:-)
This being a late answer, there is now something in the official documentation: Writing a tokenizer with the
re
standard library. This is content in the Python 3 documentation that isn't in the Py 2.7 docs. But it is still applicable to older Pythons.This includes both short code, easy setup, and writing a generator as several answers here have proposed.
If the docs are not Pythonic, I don't know what is :-)