将前瞻与生成器一起使用
我在 Python 中实现了一个基于生成器的扫描器,它将字符串标记为 (标记类型,标记值) 形式的元组:
for token in scan("a(b)"):
print token
将打印下
("literal", "a")
("l_paren", "(")
...
一个任务意味着解析标记流,为此,我需要能够在不向前移动指针的情况下从当前一项向前查看一项。事实上,迭代器和生成器不会立即提供完整的项目序列,而是根据需要提供每个项目,这使得向前查找比列表更加棘手,因为除非调用 __next__()
,否则不知道下一个项目。
基于生成器的前瞻的直接实现是什么样的?目前我正在使用一种解决方法,这意味着从生成器中创建一个列表:
token_list = [token for token in scan(string)]
然后可以通过类似的方法轻松实现前瞻:
try:
next_token = token_list[index + 1]
except: IndexError:
next_token = None
当然,这工作得很好。但仔细想想,我的第二个问题出现了:首先让 scan()
成为生成器真的有意义吗?
I have implemented a generator-based scanner in Python that tokenizes a string into tuples of the form (token type, token value):
for token in scan("a(b)"):
print token
would print
("literal", "a")
("l_paren", "(")
...
The next task implies parsing the token stream and for that, I need be able to look one item ahead from the current one without moving the pointer ahead as well. The fact that iterators and generators do not provide the complete sequence of items at once but each item as needed makes lookaheads a bit trickier compared to lists, since the next item is not known unless __next__()
is called.
What could a straightforward implementation of a generator-based lookahead look like? Currently I'm using a workaround which implies making a list out of the generator:
token_list = [token for token in scan(string)]
The lookahead then is easily implemented by something like that:
try:
next_token = token_list[index + 1]
except: IndexError:
next_token = None
Of course this just works fine. But thinking that over, my second question arises: Is there really a point of making scan()
a generator in the first place?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
那里的答案非常好,但我最喜欢的方法是使用 itertools.tee - 给定一个迭代器,它返回两个(或更多,如果需要的话)可以独立推进的。它在内存中缓冲的量与需要的量一样(即,如果迭代器彼此之间没有变得非常“失步”,则不会太多)。例如:
您可以使用此类包装任何迭代器,然后使用包装器的
.lookahead
属性来了解将来要返回的下一个项目是什么。我喜欢将所有真正的逻辑留给 itertools.tee,只提供这种薄胶!-)Pretty good answers there, but my favorite approach would be to use
itertools.tee
-- given an iterator, it returns two (or more if requested) that can be advanced independently. It buffers in memory just as much as needed (i.e., not much, if the iterators don't get very "out of step" from each other). E.g.:You can wrap any iterator with this class, and then use the
.lookahead
attribute of the wrapper to know what the next item to be returned in the future will be. I like to leave all the real logic to itertools.tee and just provide this thin glue!-)您可以编写一个包装器来缓冲来自生成器的一些项目,并提供 Lookahead() 函数来查看这些缓冲的项目:
You can write a wrapper that buffers some number of items from the generator, and provides a lookahead() function to peek at those buffered items:
它并不漂亮,但这可能会满足您的要求
:
It's not pretty, but this may do what you want:
Prints:
这是一个允许将单个项目发送回生成器的示例
Here is an example that allows a single item to be sent back to the generator
使用 itertools.tee< 构造一个简单的前向包装器/a>:
使用类包装现有的可迭代对象或迭代器。然后,您可以使用 next 正常迭代,也可以使用索引查找进行前瞻。
要在 Python 3 下运行此代码,只需将 next 方法更改为 __next__。
Construct a simple lookahead wrapper using itertools.tee:
Use the class to wrap an existing iterable or iterator. You can then either iterate normally using next or you can lookahead with indexed lookups.
To run this code under Python 3, simply change the next method to __next__.
既然你说你正在标记一个字符串而不是一般的可迭代对象,我建议最简单的解决方案是扩展你的标记生成器以返回一个三元组:
(token_type, token_value, token_index)
,其中token_index
是字符串中令牌的索引。然后您可以向前、向后或字符串中的任何其他位置进行查看。只是不要越过终点。我认为最简单、最灵活的解决方案。此外,您不需要使用列表理解来从生成器创建列表。只需调用 list() 构造函数即可:
Since you say you are tokenizing a string and not a general iterable, I suggest the simplest solution of just expanding your tokenizer to return a 3-tuple:
(token_type, token_value, token_index)
, wheretoken_index
is the index of the token in the string. Then you can look forward, backward, or anywhere else in the string. Just don't go past the end. Simplest and most flexible solution I think.Also, you needn't use a list comprehension to create a list from a generator. Just call the list() constructor on it:
保罗的回答很好。具有任意前瞻的基于类的方法可能类似于:
Paul's is a good answer. A class based approach with arbitrary lookahead might look something like:
如果我只需要 1 个元素的前瞻,我将如何简洁地编写它:
示例:
How I would write it concisely, if I just needed 1 element's worth of lookahead:
Example:
您可以使用
lazysequence
,这是一个包装可迭代对象并缓存的不可变序列消耗了内部缓冲区中的项目。您可以像使用任何列表或元组一样使用它,但迭代器仅按给定操作所需的程度进行高级。以下是您的示例使用惰性序列的样子:
以下是您自己实现惰性序列的方式:
这是一个简单的实现。这里缺少一些东西:
if s
) 中,将评估整个序列,而不仅仅是第一项。len(s)
和s[i]
需要迭代序列,即使项目已经存储在内部缓存中。s[-1]
) 和切片 (s[:2]
)。PyPI 包解决了这些问题以及其他一些问题。最后一个警告适用于上述实现和包:
len(s)
产生将迭代器消耗到底的成本。披露:我是
lazysequence
的作者。You can use
lazysequence
, an immutable sequence that wraps an iterable and caches consumed items in an internal buffer. You can use it like any list or tuple, but the iterator is only advanced as much as required for a given operation.Here's how your example would look like with lazy sequences:
And here's how you could implement lazy sequences yourself:
This is a naive implementation. Some things that are missing here:
if s
), the entire sequence is evaluated, instead of just the first item.len(s)
ands[i]
require iterating through the sequence, even when items are already stored in the internal cache.s[-1]
) and slices (s[:2]
) are not supported.The PyPI package addresses these issues, and a few more. A final caveat applies both to the implementation above and the package:
len(s)
to incur the cost of consuming the iterator to its end.Disclosure: I'm the author of
lazysequence
.