“排序一维迭代器”基于“2-d迭代器” (迭代器的笛卡尔积)
我正在寻找一种在 Python 中执行此操作的干净方法:
假设我有两个迭代器“iter1”和“iter2”:可能是一个素数生成器和 itertools.count()。我先验地知道两者都是无限且单调递增的。现在我想对两个参数“op”(可能是operator.add或operator.mul)进行一些简单的操作,并使用每个元素计算第一个迭代器的每个元素下一个,使用所述操作,然后一次产生一个,并排序。显然,这本身就是一个无限序列。 (正如 @RyanThompson 的评论中提到的:这将被称为这些序列的笛卡尔积。或者,更准确地说,该产品的一维排序。)
最好的方法是什么:
- 将“iter1”、“iter2”和“op”包装在一个迭代中,该迭代本身单调地产生值增加产量。
允许的简化假设:
- 如果有帮助的话,我们可以假设 op(a,b) >= a 且 op(a,b) >= b。
- 如果有帮助,我们可以假设 op(a,b) > op(a,c) 对于所有 b > c.
也是允许的:
- 同样可以接受的是一个迭代器,它以“普遍递增”的顺序产生值......我的意思是,可迭代偶尔会给我一个小于前一个数字的数字,但它会以某种方式使“辅助信息”可用(就像通过对象上的方法一样),这会说“我不保证我给你的下一个值将大于我刚刚给你的值,但我确信所有未来的值至少都会大于 N。 “......和“N”本身是单调递增的。
我能想到的唯一方法是一种“对角化”过程,其中我保留越来越多的部分处理的迭代,并“向前看”所有可能的 next() 值的最小值,并产生。但是,即使在我开始编写代码之前,这种由 heapq 和一堆双端队列组成的奇怪聚集就显得很奇怪。
请:不要将您的答案建立在我的示例提到素数或 count() 的事实之上......我对这个概念有多种用途,但与素数和 count() 无关。
更新:天哪!多么精彩的讨论啊!还有一些很棒的答案和非常详尽的解释。非常感谢。 StackOverflow 岩石;你们摇滚。
我将很快更彻底地深入研究每个答案,并为示例代码提供一些帮助。从我到目前为止所读到的内容来看,我最初的怀疑得到了证实,没有“简单的 Python 习惯用法”可以做到这一点。相反,无论如何,我无法避免无限期地保留 iter1 和 iter2 的所有生成值。
FWIW:如果您想尝试您的解决方案,这里有一个官方“测试用例”。
import operator
def powers_of_ten():
n = 0
while True:
yield 10**n
n += 1
def series_of_nines():
yield 1
n = 1
while True:
yield int("9"*n)
n += 1
op = operator.mul
iter1 = powers_of_ten()
iter2 = series_of_nines()
# given (iter1, iter2, op), create an iterator that yields:
# [1, 9, 10, 90, 99, 100, 900, 990, 999, 1000, 9000, 9900, 9990, 9999, 10000, ...]
I'm looking for a clean way to do this in Python:
Let's say I have two iterators "iter1" and "iter2": perhaps a prime number generator, and itertools.count(). I know a priori that both are are infinite and monotonically increasing. Now I want to take some simple operation of two args "op" (perhaps operator.add or operator.mul), and calculate every element of the first iterator with every element of the next, using said operation, then yield them one at a time, sorted. Obviously, this is an infinite sequence itself. (As mentioned in comment by @RyanThompson: this would be called the Cartesian Product of these sequences... or, more exactly, the 1d-sort of that product.)
What is the best way to:
- wrap-up "iter1", "iter2", and "op" in an iterable that itself yields the values in monotonically increasing output.
Allowable simplifying assumptions:
- If it helps, we can assume op(a,b) >= a and op(a,b) >= b.
- If it helps, we can assume op(a,b) > op(a,c) for all b > c.
Also allowable:
- Also acceptable would be an iterator that yields values in "generally increasing" order... by which I mean the iterable could occasionally give me a number less than a previous one, but it would somehow make "side information" available (as via a method on the object) that would say "I'm not guarenteeing the next value I give you will be greater than the one I just gave you, but I AM SURE that all future values will at least be greater than N.".... and "N" itself is monotonically increasing.
The only way I can think to do this is a sort of "diagonalization" process, where I keep an increasing number of partially processed iterables around, and "look ahead" for the minimum of all the possible next() values, and yield that. But this weird agglomeration of a heapq and a bunch of deques just seems outlandish, even before I start to code it.
Please: do not base your answer on the fact that my examples mentioned primes or count().... I have several uses for this very concept that are NOT related to primes and count().
UPDATE: OMG! What a great discussion! And some great answers with really thorough explanations. Thanks so much. StackOverflow rocks; you guys rock.
I'm going to delve in to each answer more thoroughly soon, and give the sample code a kick in the tires. From what I've read so far, my original suspicions are confirmed that there is no "simple Python idiom" to do this. Rather, by one way or another, I can't escape keeping all yielded values of iter1 and iter2 around indefinitely.
FWIW: here's an official "test case" if you want to try your solutions.
import operator
def powers_of_ten():
n = 0
while True:
yield 10**n
n += 1
def series_of_nines():
yield 1
n = 1
while True:
yield int("9"*n)
n += 1
op = operator.mul
iter1 = powers_of_ten()
iter2 = series_of_nines()
# given (iter1, iter2, op), create an iterator that yields:
# [1, 9, 10, 90, 99, 100, 900, 990, 999, 1000, 9000, 9900, 9990, 9999, 10000, ...]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
将序列定义为:
让
a1b1
简称为op(a1,b1)
。根据您允许的假设(非常重要),您知道以下内容:
您必须执行以下操作:
生成
a1b1
。您知道,如果继续增加b
变量,您只会得到更高的值。现在的问题是:增加a
变量是否有更小的值?您的下限是min(a1, b1)
,因此您必须增加a
值,直到min(ax,b1) >= a1b1
代码>.一旦达到该点,您就可以从anb1
中找到最小值,其中1 <= n <= x
并安全地得出该值。然后,您将拥有多个必须跟踪的水平链。每当您的值超过
min(ax,b1)
时,您就必须增加x
(添加更多链),直到min(ax, b1)
在安全发射之前比它大。只是一个起点......我现在没有时间编码。
编辑:哦,这正是你已经拥有的。好吧,没有更多信息,这就是您所能做的,因为我很确定从数学上讲,这就是必要的。
EDIT2:至于你的“可接受”的解决方案:你可以按照
n
的递增顺序产生a1bn
,返回min(a1,b1)
为N
=P。你需要更具体。你说得好像你对你通常想要看到的东西有一种启发式的感觉,你想要在两个迭代中取得进展的一般方式,但如果不告诉我们它是什么,我不知道如何做得更好。更新:温斯顿的很好,但假设海报没有提到:
op(a,c)
> >如果b>a
,则op(b,c)
。但是,我们确实知道op(a,b)>=a
和op(a,b)>=b
。这是我的解决方案,它采用了第二个假设,但不是温斯顿采用的假设。不过,支持他的代码结构:
对于演示差异的(病态)输入,请考虑:
温斯顿的答案给出:
而我的答案给出:
Define the sequences as:
Let
a1b1
meanop(a1,b1)
for short.Based on your allowable assumptions (very important) you know the following:
You'll have to do something like:
Generate
a1b1
. You know that if you continue increasing theb
variables, you will only get higher values. The question now is: is there a smaller value by increasing thea
variables? Your lower bound ismin(a1, b1)
, so you will have to increase thea
values untilmin(ax,b1) >= a1b1
. Once you hit that point, you can find the smallest value fromanb1
where1 <= n <= x
and yield that safely.You then will have multiple horizontal chains that you'll have to keep track of. Every time you have a value that goes past
min(ax,b1)
, you'll have to increasex
(adding more chains) untilmin(ax,b1)
is larger than it before safely emitting it.Just a starting point... I don't have time to code it at the moment.
EDIT: Oh heh that's exactly what you already had. Well, without more info, this is all you can do, as I'm pretty sure that mathematically, that's what is necessary.
EDIT2: As for your 'acceptable' solution: you can just yield
a1bn
in increasing order ofn
, returningmin(a1,b1)
asN
=P. You'll need to be more specific. You speak as if you have a heuristic of what you generally want to see, the general way you want to progress through both iterables, but without telling us what it is I don't know how one could do better.UPDATE: Winston's is good but makes an assumption that the poster didn't mention: that
op(a,c)
>op(b,c)
ifb>a
. However, we do know thatop(a,b)>=a
andop(a,b)>=b
.Here is my solution which takes that second assumption but not the one Winston took. Props to him for the code structure, though:
For a (pathological) input that demonstrates the difference, consider:
Winston's answer gives:
While mine gives:
让我从一个例子开始,说明如何直观地解决这个问题。
因为内联阅读代码有点乏味,所以我将介绍一些符号:
符号
iter1
。 i10 将表示iter1
的第一个元素。与iter2
相同。op
运算符直观的解决方案
通过使用简化的假设 2,我们知道 i10 ※ i2< sub>0 是最终迭代器产生的最小元素。下一个元素将是 i10 中较小的一个 ※ i21 和 i1< sub>1 ※ i20。
假设 i10 ※ i21 较小,您将产生该元素。接下来,您将产生 i11 ※ i20、i1 中较小的一个1 ※ i20 和 i11 ※ i21。
作为 DAG 遍历的表达式
这里有一个图遍历问题。首先,将问题视为一棵树。树的根是i10 ※ i20。该节点及其下面的每个节点都有两个子节点。 i1x ※ i2y 的两个子元素如下:其中一个子元素是 i1x+1 ※ i2y,另一个孩子是i1x ※ i2y+1。根据您的第二个假设,我们知道 i1x ※ i2y 小于其两个子级。
(事实上,正如 Ryan 在评论中提到的,这是一个有向无环图,或 DAG。一些“父节点”与其他“父节点”共享“子节点”。)
现在,我们需要保留一个边界 - 下一个可能返回的节点的集合。返回一个节点后,我们将其两个子节点添加到边界。为了选择要访问的下一个节点(并从新迭代器返回),我们比较边界中所有节点的值。我们取出具有最小值的节点并将其返回。然后,我们再次将其两个子节点添加到边界。如果孩子已经在边界中(添加为其他父母的孩子),则忽略它。
存储边界
因为您主要对节点的值感兴趣,所以存储按值索引的节点是有意义的。因此,使用
dict
可能符合您的兴趣。该字典中的键应该是节点的值。该字典中的值应该是包含各个节点的列表。由于节点中唯一的标识信息是操作数对,因此您可以将各个节点存储为操作数的二元组。实际上,经过几次迭代后,您的边界可能如下所示:
其他实现说明
由于迭代器不支持随机访问,因此您需要保留前两个迭代器生成的值,直到它们不再是需要。
如果边界中的任何值引用了某个值,您就会知道该值仍然是需要的。一旦边界参考值中的所有节点晚于/大于您存储的值,您就会知道不再需要某个值。例如,当边界中的节点仅引用 i121、时,不再需要 i120 i125, i133, ...正如 Ryan 所提到的,每个迭代器的每个值都将是使用了无数次。因此,所产生的每一个价值都需要保存。
不切实际
不幸的是,为了确保元素仅按递增顺序返回,边界将无限制地增长。您记忆的值
可能还会占用大量空间并且还会无限制地增长。您可以通过减少问题的普遍性来解决这个问题,但这应该是一个很好的起点。Let me start with an example of how I would solve this intuitively.
Because reading code inline is a little tedious, I'll introduce some notation:
Notation
iter1
. i10 will represent the first element ofiter1
. Same foriter2
.op
operatorIntuitive solution
By using simplifying assumption 2, we know that i10 ※ i20 is the smallest element that will ever be yielded from your final iterator. The next element would the smaller of i10 ※ i21 and i11 ※ i20.
Assuming i10 ※ i21 is smaller, you would yield that element. Next, you would yield the smaller of i11 ※ i20, i11 ※ i20, and i11 ※ i21.
Expression as traversal of a DAG
What you have here is a graph traversal problem. First, think of the problem as a tree. The root of the tree is i10 ※ i20. This node, and each node below it, has two children. The two children of i1x ※ i2y are the following: One child is i1x+1 ※ i2y, and the other child is i1x ※ i2y+1. Based on your second assumption, we know that i1x ※ i2y is less than both of its children.
(In fact, as Ryan mentions in a comment, this is a directed acyclic graph, or DAG. Some "parents" share "children" with other "parent" nodes.)
Now, we need to keep a frontier - a collection of nodes that could be next to be returned. After returning a node, we add both its children to the frontier. To select the next node to visit (and return from your new iterator), we compare the values of all the nodes in the frontier. We take the node with the smallest value and we return it. Then, we again add both of its child nodes to the frontier. If the child is already in the frontier (added as the child of some other parent), just ignore it.
Storing the frontier
Because you are primarily interested in the value of the nodes, it makes sense to store these nodes indexed by value. As such, it may be in your interest to use a
dict
. Keys in this dict should be the values of nodes. Values in this dict should be lists containing individual nodes. Because the only identifying information in a node is the pair of operands, you can store individual nodes as a two-tuple of operands.In practice, after a few iterations, your frontier may look like the following:
Other implementation notes
Because iterators don't support random access, you'll need to hang on to values that are produced by your first two iterators until they are no longer needed.
You'll know that a value is still needed if it is referenced by any value in your frontier. You'll know that a value is no longer needed once all nodes in the frontier reference values later/greater than one you've stored. For example, i120 is no longer needed when nodes in your frontier reference only i121, i125, i133, ...As mentioned by Ryan, each value from each iterator will be used an infinite number of times. Thus, every value produced will need to be saved.
Not practical
Unfortunately, in order to assure that elements are returned only in increasing order, the frontier will grow without bound. Your memoized values will
probably also take a significant amount of spacealso grow without bound. This may be something you can address by making your problem less general, but this should be a good starting point.因此,您基本上想要采用两个单调递增的序列,然后(惰性地)计算它们之间的乘法(或加法,或其他运算)表,这是一个二维数组。然后,您想要按排序顺序放置该二维数组的元素并迭代它们。
一般来说,这是不可能的。但是,如果您的顺序和操作可以对表的行和列做出一定的保证,那么您可以取得一些进展。例如,假设您的序列只是正整数的单调递增序列,并且操作是乘法(如您的示例中所示)。在这种情况下,我们知道数组的每一行和每一列都是一个单调递增的序列。在这种情况下,您不需要计算整个数组,而只需计算其中的一部分。具体来说,您必须跟踪以下内容:
计算对于迭代器中的下一个元素,您必须执行以下操作:
这个过程有点复杂,特别要注意的是,要计算 N 值,在最坏的情况下,您必须保存与 N 的
平方根成正比的状态量。(编辑:sqrt(N)实际上是最好情况。)这与典型的生成器形成鲜明对比,典型的生成器只需要恒定的空间来迭代其元素,而不管长度如何。总之,您可以在某些假设下执行此操作,并且可以为其提供类似生成器的接口,但不能以“流”方式完成,因为您需要保存大量状态才能迭代元素按正确的顺序排列。
So you basically want to take two monotonically increasing sequences, and then (lazily) compute the multiplication (or addition, or another operation) table between them, which is a 2-D array. Then you want to put the elements of that 2-D array in sorted order and iterate through them.
In general, this is impossible. However, if your sequences and operation are such that you can make certain guarantees about the rows and columns of the table, then you can make some progress. For example, let's assume that your sequences are monitonically-increasing sequences of positive integers only, and that the operation is multiplication (as in your example). In this case, we know that every row and column of the array is a monotonically-increasing sequence. In this case, you do not need to compute the entire array, but rather only parts of it. Specifically, you must keep track of the following:
To compute the next element in your iterator, you must do the following:
This process is kind of complex, and in particular notice that to compute N values, you must in the worst case save an amount of state proportional to the
square root ofN. (Edit: sqrt(N) is actually the best case.) This is in stark contrast to a typical generator, which only requires constant space to iterate through its elements regardless of length.In summary, you can do this under certain assumptions, and you can provide a generator-like interface to it, but it cannot be done in a "streaming" fashion, because you need to save a lot of state in order to iterate through the elements in the correct order.
使用 生成器,它们只是编写为产生结果的函数的迭代器。在这种情况下,您可以为
iter1
和iter2
编写生成器,并编写另一个生成器来包装它们并生成它们的结果(或者用它们进行计算,或者它们结果的历史记录),如下所示你去。从我对问题的阅读来看,您想要这样的东西,它将使用所述操作计算第一个迭代器的每个元素与下一个迭代器的每个元素,您还声明您想要某种方式将“iter1”、“iter2”和“op”包装在一个迭代中,该迭代本身会产生单调递增输出的值。我建议生成器为此类问题提供一个简单的解决方案。
生成器提供了很大的灵活性,因此将
iter1
和iter2
编写为生成器应该相当容易,它们可以按照您想要的顺序返回您想要的值。您还可以考虑使用 协程,它可以让您将值发送到 一个发电机。Use generators, which are just iterators written as functions that yield results. In this case you can write generators for
iter1
anditer2
and another generator to wrap them and yield their results (or do calculations with them, or the history of their results) as you go.From my reading of the question you want something like this, which will calculate every element of the first iterator with every element of the next, using said operation, you also state you want some way to wrap-up "iter1", "iter2", and "op" in an iterable that itself yields the values in monotonically increasing output. I propose generators offer a simple solution to such problem.
Generators offer a lot of flexibility, so it should be fairly easy to write
iter1
anditer2
as generators that return values you want in the order you want. You could also consider using coroutines, which let you send values into a generator.其他答案中的讨论表明,无论采用什么算法,都可能需要无限的存储空间,因为每个
a[n]
必须对每个新的b[n]
保持可用。如果您删除输入是两个迭代器的限制,而只要求它们是序列(可索引或仅仅是可以重复生成的东西),那么我相信您的所有状态都会突然崩溃为一个数字:您返回的最后一个值。知道最后一个结果值,您可以搜索输出空间以查找下一个结果值。 (如果您想正确地发出重复项,那么您可能还需要跟踪返回结果的次数)对于一对序列,您有一个简单的递归关系:
其中
f(seq1, seq1, p)< /code> 在输出空间
q
中搜索最小值,使得q > > p
。实际上,您可能会将序列设为记忆函数并选择搜索算法以避免破坏记忆项目池。Discussion in other answers observes that there is potentially infinite storage required no matter what the algorithm, since every
a[n]
must remain available for each newb[n]
. If you remove the restriction that the input be two iterators and instead only require that they be sequences (indexable or merely something that can be regenerated repeatedly) then I believe all of your state suddenly collapses to one number: The last value you returned. Knowing the last result value you can search the output space looking for the next one. (If you want to emit duplicates properly then you may need to also track the number of times the result has been returned)With a pair of sequences you have a simple recurrence relation:
where
f(seq1, seq1, p)
searches for the minimum value in the output spaceq
such thatq > p
. In practical terms you'd probably make the sequences memoized functions and choose your search algorithm to avoid thrashing the pool of memoized items.