连接一组有序整数生成 Python 迭代器
这是一个看似简单的问题:给定一个按升序生成整数序列的迭代器列表,编写一个简洁的生成器,仅生成每个序列中出现的整数。
昨晚读了几篇论文后,我决定用 Python 编写一个完全最小的全文索引器,如此处所示(尽管该版本现在已经很旧了)。
我的问题是 search()
函数,它必须迭代每个发布列表并仅生成每个列表中出现的文档 ID。 正如您从上面的链接中看到的,我当前的非递归“工作”尝试非常糟糕。
示例:
postings = [[1, 100, 142, 322, 12312],
[2, 100, 101, 322, 1221],
[100, 142, 322, 956, 1222]]
应该产生:
[100, 322]
至少有一个优雅的递归函数解决方案,但如果可能的话我想避免这种情况。 然而,涉及嵌套生成器表达式、itertools 滥用或任何其他类型的代码高尔夫的解决方案是非常受欢迎的。 :-)
应该可以安排函数只需要与最小列表中的项目一样多的步骤,而不会将整个整数集吸入内存。 将来,这些列表可能会从磁盘读取,并且会大于可用 RAM。
在过去的 30 分钟里,我的脑海里浮现出一个想法,但我无法将其转化为代码。 请记住,这只是为了好玩!
Here is a seemingly simple problem: given a list of iterators that yield sequences of integers in ascending order, write a concise generator that yields only the integers that appear in every sequence.
After reading a few papers last night, I decided to hack up a completely minimal full text indexer in Python, as seen here (though that version is quite old now).
My problem is with the search()
function, which must iterate over each posting list and yield only the document IDs that appear on every list. As you can see from the link above, my current non-recursive 'working' attempt is terrible.
Example:
postings = [[1, 100, 142, 322, 12312],
[2, 100, 101, 322, 1221],
[100, 142, 322, 956, 1222]]
Should yield:
[100, 322]
There is at least one elegant recursive function solution to this, but I'd like to avoid that if possible. However, a solution involving nested generator expressions, itertools
abuse, or any other kind of code golf is more than welcome. :-)
It should be possible to arrange for the function to only require as many steps as there are items in the smallest list, and without sucking the entire set of integers into memory. In future, these lists may be read from disk, and larger than available RAM.
For the past 30 minutes I've had an idea on the tip of my tongue, but I can't quite get it into code. Remember, this is just for fun!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
...你可以尝试利用列表是有序的这一事实,但由于reduce、生成器表达式和set都是用C实现的,因此你可能很难用python实现的逻辑做得比上面更好。
... you could try and take advantage of the fact that the lists are ordered, but since reduce, generator expressions and set are all implemented in C, you'll probably have a hard time doing better than the above with logic implemented in python.
该解决方案将计算迭代器的交集。 它的工作原理是一次推进迭代器一步,并在所有迭代器中寻找相同的值。 当找到时,就会产生这样的值——这使得 intersect 函数本身成为生成器。
This solution will compute the intersection of your iterators. It works by advancing the iterators one step at a time and looking for the same value in all of them. When found, such values are yielded -- this makes the
intersect
function a generator itself.如果这些序列确实很长(甚至无限),并且您不想提前将所有内容加载到集合中,则可以在每个迭代器上使用 1 项前瞻来实现此目的。
用法:
If these are really long (or even infinite) sequences, and you don't want to load everything into a set in advance, you can implement this with a 1-item lookahead on each iterator.
Usage:
这个怎么样:
我还没有非常彻底地测试它(只是运行了你的例子),但我相信基本的想法是合理的。
What about this:
I haven't tested it very thoroughly (just ran your example), but I believe the basic idea is sound.
我想证明有一个优雅的解决方案,它仅向前迭代一次。 抱歉,我不太了解 Python,所以我使用虚构的类。 它读取
input
(一个迭代器数组),并即时写入output
,而无需返回或使用任何数组函数!I want to show that there's an elegant solution, which only iterates forward once. Sorry, I don't know the Python well enough, so I use fictional classes. This one reads
input
, an array of iterators, and writes tooutput
on-the-fly without ever going back or using any array function!.这个运行时间为
O(n*m)
,其中n
是所有迭代器长度的总和,m
是列表的数量。 通过使用第 6 行中的堆,可以将其变为O(n*logm)
。This one runs in
O(n*m)
wheren
is the sum of all iterator lengths, andm
is the number of lists. It can be madeO(n*logm)
by using a heap in line 6.