将前瞻与生成器一起使用

发布于 2024-08-06 21:15:43 字数 706 浏览 3 评论 0原文

我在 Python 中实现了一个基于生成器的扫描器,它将字符串标记为 (标记类型,标记值) 形式的元组:

for token in scan("a(b)"):
    print token

将打印下

("literal", "a")
("l_paren", "(")
...

一个任务意味着解析标记流,为此,我需要能够在不向前移动指针的情况下从当前一项向前查看一项。事实上,迭代器和生成器不会立即提供完整的项目序列,而是根据需要提供每个项目,这使得向前查找比列表更加棘手,因为除非调用 __next__() ,否则不知道下一个项目。

基于生成器的前瞻的直接实现是什么样的?目前我正在使用一种解决方法,这意味着从生成器中创建一个列表:

token_list = [token for token in scan(string)]

然后可以通过类似的方法轻松实现前瞻:

try:
    next_token = token_list[index + 1]
except: IndexError:
    next_token = None

当然,这工作得很好。但仔细想想,我的第二个问题出现了:首先让 scan() 成为生成器真的有意义吗?

I have implemented a generator-based scanner in Python that tokenizes a string into tuples of the form (token type, token value):

for token in scan("a(b)"):
    print token

would print

("literal", "a")
("l_paren", "(")
...

The next task implies parsing the token stream and for that, I need be able to look one item ahead from the current one without moving the pointer ahead as well. The fact that iterators and generators do not provide the complete sequence of items at once but each item as needed makes lookaheads a bit trickier compared to lists, since the next item is not known unless __next__() is called.

What could a straightforward implementation of a generator-based lookahead look like? Currently I'm using a workaround which implies making a list out of the generator:

token_list = [token for token in scan(string)]

The lookahead then is easily implemented by something like that:

try:
    next_token = token_list[index + 1]
except: IndexError:
    next_token = None

Of course this just works fine. But thinking that over, my second question arises: Is there really a point of making scan() a generator in the first place?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

红尘作伴 2024-08-13 21:15:43

那里的答案非常好,但我最喜欢的方法是使用 itertools.tee - 给定一个迭代器,它返回两个(或更多,如果需要的话)可以独立推进的。它在内存中缓冲的量与需要的量一样(即,如果迭代器彼此之间没有变得非常“失步”,则不会太多)。例如:

import itertools
import collections

class IteratorWithLookahead(collections.Iterator):
  def __init__(self, it):
    self.it, self.nextit = itertools.tee(iter(it))
    self._advance()
  def _advance(self):
    self.lookahead = next(self.nextit, None)
  def __next__(self):
    self._advance()
    return next(self.it)

您可以使用此类包装任何迭代器,然后使用包装器的 .lookahead 属性来了解将来要返回的下一个项目是什么。我喜欢将所有真正的逻辑留给 itertools.tee,只提供这种薄胶!-)

Pretty good answers there, but my favorite approach would be to use itertools.tee -- given an iterator, it returns two (or more if requested) that can be advanced independently. It buffers in memory just as much as needed (i.e., not much, if the iterators don't get very "out of step" from each other). E.g.:

import itertools
import collections

class IteratorWithLookahead(collections.Iterator):
  def __init__(self, it):
    self.it, self.nextit = itertools.tee(iter(it))
    self._advance()
  def _advance(self):
    self.lookahead = next(self.nextit, None)
  def __next__(self):
    self._advance()
    return next(self.it)

You can wrap any iterator with this class, and then use the .lookahead attribute of the wrapper to know what the next item to be returned in the future will be. I like to leave all the real logic to itertools.tee and just provide this thin glue!-)

紫南 2024-08-13 21:15:43

您可以编写一个包装器来缓冲来自生成器的一些项目,并提供 Lookahead() 函数来查看这些缓冲的项目:

class Lookahead:
    def __init__(self, iter):
        self.iter = iter
        self.buffer = []

    def __iter__(self):
        return self

    def next(self):
        if self.buffer:
            return self.buffer.pop(0)
        else:
            return self.iter.next()

    def lookahead(self, n):
        """Return an item n entries ahead in the iteration."""
        while n >= len(self.buffer):
            try:
                self.buffer.append(self.iter.next())
            except StopIteration:
                return None
        return self.buffer[n]

You can write a wrapper that buffers some number of items from the generator, and provides a lookahead() function to peek at those buffered items:

class Lookahead:
    def __init__(self, iter):
        self.iter = iter
        self.buffer = []

    def __iter__(self):
        return self

    def next(self):
        if self.buffer:
            return self.buffer.pop(0)
        else:
            return self.iter.next()

    def lookahead(self, n):
        """Return an item n entries ahead in the iteration."""
        while n >= len(self.buffer):
            try:
                self.buffer.append(self.iter.next())
            except StopIteration:
                return None
        return self.buffer[n]
李不 2024-08-13 21:15:43

它并不漂亮,但这可能会满足您的要求

def paired_iter(it):
    token = it.next()
    for lookahead in it:
        yield (token, lookahead)
        token = lookahead
    yield (token, None)

def scan(s):
    for c in s:
        yield c

for this_token, next_token in paired_iter(scan("ABCDEF")):
    print "this:%s next:%s" % (this_token, next_token)

this:A next:B
this:B next:C
this:C next:D
this:D next:E
this:E next:F
this:F next:None

It's not pretty, but this may do what you want:

def paired_iter(it):
    token = it.next()
    for lookahead in it:
        yield (token, lookahead)
        token = lookahead
    yield (token, None)

def scan(s):
    for c in s:
        yield c

for this_token, next_token in paired_iter(scan("ABCDEF")):
    print "this:%s next:%s" % (this_token, next_token)

Prints:

this:A next:B
this:B next:C
this:C next:D
this:D next:E
this:E next:F
this:F next:None
凯凯我们等你回来 2024-08-13 21:15:43

这是一个允许将单个项目发送回生成器的示例

def gen():
    for i in range(100):
        v=yield i           # when you call next(), v will be set to None
        if v:
            yield None      # this yields None to send() call
            v=yield v       # so this yield is for the first next() after send()

g=gen()

x=g.next()
print 0,x

x=g.next()
print 1,x

x=g.next()
print 2,x # oops push it back

x=g.send(x)

x=g.next()
print 3,x # x should be 2 again

x=g.next()
print 4,x

Here is an example that allows a single item to be sent back to the generator

def gen():
    for i in range(100):
        v=yield i           # when you call next(), v will be set to None
        if v:
            yield None      # this yields None to send() call
            v=yield v       # so this yield is for the first next() after send()

g=gen()

x=g.next()
print 0,x

x=g.next()
print 1,x

x=g.next()
print 2,x # oops push it back

x=g.send(x)

x=g.next()
print 3,x # x should be 2 again

x=g.next()
print 4,x
怎樣才叫好 2024-08-13 21:15:43

使用 itertools.tee< 构造一个简单的前向包装器/a>:

from itertools import tee, islice

class LookAhead:
    'Wrap an iterator with lookahead indexing'
    def __init__(self, iterator):
        self.t = tee(iterator, 1)[0]
    def __iter__(self):
        return self
    def next(self):
        return next(self.t)
    def __getitem__(self, i):
        for value in islice(self.t.__copy__(), i, None):
            return value
        raise IndexError(i)

使用类包装现有的可迭代对象或迭代器。然后,您可以使用 next 正常迭代,也可以使用索引查找进行前瞻。

>>> it = LookAhead([10, 20, 30, 40, 50])
>>> next(it)
10
>>> it[0]
20
>>> next(it)
20
>>> it[0]
30
>>> list(it)
[30, 40, 50]

要在 Python 3 下运行此代码,只需将 next 方法更改为 __next__

Construct a simple lookahead wrapper using itertools.tee:

from itertools import tee, islice

class LookAhead:
    'Wrap an iterator with lookahead indexing'
    def __init__(self, iterator):
        self.t = tee(iterator, 1)[0]
    def __iter__(self):
        return self
    def next(self):
        return next(self.t)
    def __getitem__(self, i):
        for value in islice(self.t.__copy__(), i, None):
            return value
        raise IndexError(i)

Use the class to wrap an existing iterable or iterator. You can then either iterate normally using next or you can lookahead with indexed lookups.

>>> it = LookAhead([10, 20, 30, 40, 50])
>>> next(it)
10
>>> it[0]
20
>>> next(it)
20
>>> it[0]
30
>>> list(it)
[30, 40, 50]

To run this code under Python 3, simply change the next method to __next__.

嗫嚅 2024-08-13 21:15:43

既然你说你正在标记一个字符串而不是一般的可迭代对象,我建议最简单的解决方案是扩展你的标记生成器以返回一个三元组:
(token_type, token_value, token_index),其中 token_index 是字符串中令牌的索引。然后您可以向前、向后或字符串中的任何其他位置进行查看。只是不要越过终点。我认为最简单、最灵活的解决方案。

此外,您不需要使用列表理解来从生成器创建列表。只需调用 list() 构造函数即可:

 token_list = list(scan(string))

Since you say you are tokenizing a string and not a general iterable, I suggest the simplest solution of just expanding your tokenizer to return a 3-tuple:
(token_type, token_value, token_index), where token_index is the index of the token in the string. Then you can look forward, backward, or anywhere else in the string. Just don't go past the end. Simplest and most flexible solution I think.

Also, you needn't use a list comprehension to create a list from a generator. Just call the list() constructor on it:

 token_list = list(scan(string))
枫林﹌晚霞¤ 2024-08-13 21:15:43

保罗的回答很好。具有任意前瞻的基于类的方法可能类似于:

class lookahead(object):
    def __init__(self, generator, lookahead_count=1):
        self.gen = iter(generator)
        self.look_count = lookahead_count

    def __iter__(self):
        self.lookahead = []
        self.stopped = False
        try:
            for i in range(self.look_count):
                self.lookahead.append(self.gen.next())
        except StopIteration:
            self.stopped = True
        return self

    def next(self):
        if not self.stopped:
            try:
                self.lookahead.append(self.gen.next())
            except StopIteration:
                self.stopped = True
        if self.lookahead != []:
            return self.lookahead.pop(0)
        else:
            raise StopIteration

x = lookahead("abcdef", 3)
for i in x:
    print i, x.lookahead

Paul's is a good answer. A class based approach with arbitrary lookahead might look something like:

class lookahead(object):
    def __init__(self, generator, lookahead_count=1):
        self.gen = iter(generator)
        self.look_count = lookahead_count

    def __iter__(self):
        self.lookahead = []
        self.stopped = False
        try:
            for i in range(self.look_count):
                self.lookahead.append(self.gen.next())
        except StopIteration:
            self.stopped = True
        return self

    def next(self):
        if not self.stopped:
            try:
                self.lookahead.append(self.gen.next())
            except StopIteration:
                self.stopped = True
        if self.lookahead != []:
            return self.lookahead.pop(0)
        else:
            raise StopIteration

x = lookahead("abcdef", 3)
for i in x:
    print i, x.lookahead
稚然 2024-08-13 21:15:43

如果我只需要 1 个元素的前瞻,我将如何简洁地编写它:

SEQUENCE_END = object()

def lookahead(iterable):
    iter = iter(iterable)
    current = next(iter)
    for ahead in iter:
        yield current,ahead
        current = ahead
    yield current,SEQUENCE_END

示例:

>>> for x,ahead in lookahead(range(3)):
>>>     print(x,ahead)
0, 1
1, 2
2, <object SEQUENCE_END>

How I would write it concisely, if I just needed 1 element's worth of lookahead:

SEQUENCE_END = object()

def lookahead(iterable):
    iter = iter(iterable)
    current = next(iter)
    for ahead in iter:
        yield current,ahead
        current = ahead
    yield current,SEQUENCE_END

Example:

>>> for x,ahead in lookahead(range(3)):
>>>     print(x,ahead)
0, 1
1, 2
2, <object SEQUENCE_END>
海之角 2024-08-13 21:15:43

您可以使用 lazysequence,这是一个包装可迭代对象并缓存的不可变序列消耗了内部缓冲区中的项目。您可以像使用任何列表或元组一样使用它,但迭代器仅按给定操作所需的程度进行高级。

以下是您的示例使用惰性序列的样子:

from lazysequence import lazysequence

token_list = lazysequence(token for token in scan(string))

try:
    next_token = token_list[index + 1]
except IndexError:
    next_token = None

以下是您自己实现惰性序列的方式:

from collections.abc import Sequence


class lazysequence(Sequence):
    def __init__(self, iterable):
        self._iter = iter(iterable)
        self._cache = []

    def __iter__(self):
        yield from self._cache
        for item in self._iter:
            self._cache.append(item)
            yield item

    def __len__(self):
        return sum(1 for _ in self)

    def __getitem__(self, index):
        for position, item in enumerate(self):
            if index == position:
                return item

        raise IndexError("lazysequence index out of range")

这是一个简单的实现。这里缺少一些东西:

  • 惰性序列最终会将所有项目存储在内存中。无法获得不再缓存项目的普通迭代器。
  • 在布尔上下文 (if s) 中,将评估整个序列,而不仅仅是第一项。
  • len(s)s[i] 需要迭代序列,即使项目已经存储在内部缓存中。
  • 不支持负索引 (s[-1]) 和切片 (s[:2])。

PyPI 包解决了这些问题以及其他一些问题。最后一个警告适用于上述实现和包:

  • 显式优于隐式。向客户传递迭代器并处理其局限性可能会更好。例如,客户端可能不希望 len(s) 产生将迭代器消耗到底的成本。

披露:我是lazysequence的作者。

You can use lazysequence, an immutable sequence that wraps an iterable and caches consumed items in an internal buffer. You can use it like any list or tuple, but the iterator is only advanced as much as required for a given operation.

Here's how your example would look like with lazy sequences:

from lazysequence import lazysequence

token_list = lazysequence(token for token in scan(string))

try:
    next_token = token_list[index + 1]
except IndexError:
    next_token = None

And here's how you could implement lazy sequences yourself:

from collections.abc import Sequence


class lazysequence(Sequence):
    def __init__(self, iterable):
        self._iter = iter(iterable)
        self._cache = []

    def __iter__(self):
        yield from self._cache
        for item in self._iter:
            self._cache.append(item)
            yield item

    def __len__(self):
        return sum(1 for _ in self)

    def __getitem__(self, index):
        for position, item in enumerate(self):
            if index == position:
                return item

        raise IndexError("lazysequence index out of range")

This is a naive implementation. Some things that are missing here:

  • The lazy sequence will eventually store all items in memory. There's no way to obtain a normal iterator that no longer caches items.
  • In a boolean context (if s), the entire sequence is evaluated, instead of just the first item.
  • len(s) and s[i] require iterating through the sequence, even when items are already stored in the internal cache.
  • Negative indices (s[-1]) and slices (s[:2]) are not supported.

The PyPI package addresses these issues, and a few more. A final caveat applies both to the implementation above and the package:

  • Explicit is better than implicit. Clients may be better off being passed an iterator and dealing with its limitations. For example, clients may not expect len(s) to incur the cost of consuming the iterator to its end.

Disclosure: I'm the author of lazysequence.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文