使用python的lambda、map的高效方法
我需要在 Bigtable(db) 中存储一个大的整数列表。为了提高效率,我将它们存储为两个连续项目之间的差异。
例如:
original_list = [1005, 1004, 1003, 1004, 1006]
将上面的列表(实际上包含超过 1000k 个项目)存储为
start = 1005 diff = [-1, -1, 1, 2]
我可以管理的最接近的是,
ltp = [start] map(lambda x: ltp.append(ltp[-1] + x), tick)
我正在寻找一种有效的方法将其转换回原始列表。
I need to store a big list of integers in Bigtable(db). For efficiency I am storing them as diff between 2 consecutive items.
for eg:
original_list = [1005, 1004, 1003, 1004, 1006]
Storing the above list(which actually contains more than 1000k items) as
start = 1005 diff = [-1, -1, 1, 2]
The closest I could manage is,
ltp = [start] map(lambda x: ltp.append(ltp[-1] + x), tick)
I am looking for an efficient way to convert it back into original list.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
对于如此大的数据结构,numpy 可以很好地工作。对于这个例子,它的速度超过 200 倍(见下文),并且更容易编码,基本上只是
numpy 和直接列表操作之间的比较:
给出
实际上,不过,重用已建立的压缩似乎更好算法,就像可以使用 PyTables 轻松完成,而不是像您正在做的那样滚动自己这里。
另外,在这里,我建议您在读取数据时为前置的开始术语留出空间,而不是使用前置的术语重建列表,当然,这样您就不必进行复制。
For such large data structures numpy will work well. For this example, it's over 200x faster (see below), and a bit easier to code, basically just
Comparison between numpy and direct list manipulation:
gives
Really, though, it seems better to reuse an established compression algorithm, like can easily be done with PyTables, rather than rolling your own like it seems that you're doing here.
Also, here, I'm suggesting that you read in the data with room for the prepended start term, rather than rebuild the list with the prepended term, of course, so you don't have to do the copy.
以下对我有用:
使用
map
将创建一个相同大小的新数组,并填充None
。我还发现一个简单的for
循环更具可读性,在这种情况下,速度尽可能快。The following works for me:
Using
map
will create an new array of the same size, filled withNone
. I also find a simplefor
loop more readable, and in this case as fast as you can get.非常适合发电机:
Perfect for generators:
其他几个受访者对您要求的算法有合理的实现,但我不清楚您真正想要解决的问题到底是什么。
除非存储的数字非常大(即溢出整数并需要 bignum),否则差异列表不会为您带来任何效率 - 整数是来自 Python 运行时 POV 的整数,因此您的示例“diff”列表
[-1, -1, 1, 2]
将消耗与原始列表[1005, 1004, 1003, 1004, 1006]
一样多的内存。Several of the other respondents have reasonable implementations of the algorithm you asked for, but I'm unclear on exactly what problem it is you're really trying to solve.
Unless the numbers being stored are very large (i.e., overflow an integer and require bignums), your list of diffs won't gain you any efficiency -- an integer is an integer from the Python runtime POV, so your example "diff" list of
[-1, -1, 1, 2]
will consume just as much memory as the original list[1005, 1004, 1003, 1004, 1006]
.现在尝试:
Now try:
正如 mshsayem 建议的那样,使用列表推导式 - 它们通常比 for 循环或 map/lambda 更快(根据 Mark Lutz 的《学习 Python》一书)。
如果你真的想使用更 FP 式的解决方案,正确的函数是“scan”,[我相信]它没有在 Python 中实现,所以你必须自己实现它(这不是一项艰巨的任务)。
“扫描”基本上是一个减少,但它不是将列表减少为单个值,而是将每次“迭代”的结果存储在新列表中。
如果你实现了它,你可以这样做:
As mshsayem suggested, use list comprehensions - they are generally faster than for loops or map/lambdas (according do Mark Lutz's book Learning Python).
If you really want to use an more FP-ish solution, the proper function would be "scan", wich [I believe] isn't implemented in Python so you would have to implement it yourself (which is not a hard task).
"scan" is basically a reduce, but instead of reducing the list to a single value, it stores the result of each "iteration" in a new list.
If you implemented it, you could do something like:
虽然我不明白为什么这应该更有效,但我很确定 for 循环将提供最佳性能:
Although I don't get why this should be more efficient, I am pretty sure a for loop will give the best performance:
我不知道您将整数存储为差异的原因 - rcoder 给出了一个很好的答案,说明为什么这通常并不比存储整数本身更有效 - 但如果您不需要访问整个列表立刻,使用生成器在内存方面会更有效。既然你说这是一个“大列表”,那么你可以通过这种方式节省大量内存,而不是一次分配整个列表。下面是一个生成器理解来获取您的列表:
然后您可以像任何列表一样迭代 int_generator ,而无需立即将整个列表存储在内存中。但请注意,您不能为生成器添加下标或切片,但可以在许多有用的情况下使用它。
您可以清理该示例,以便起始变量不需要是全局的。它只是不能位于 mod_start 函数的本地。
编辑:您不必使用生成器理解来获取生成器。您还可以将生成器函数与yield 表达式一起使用,就像 THC4k 所做的那样。这避免了起始变量作用域问题,并且可能更干净一些。您还可以随时从生成器获取列表,方法是将其传递给 list() 内置函数。
I don't know about your reasoning for storing the integers as diffs -- rcoder gave a good answer about why this generally is not more efficient than storing the integers themselves -- but if you don't need to have access to the entire list at once, it's more efficient memory-wise for you to use a generator. Since you say this is a "big list," you can save a lot of memory this way, instead of allocating the entire list at once. Here's a generator comprehension to get your list back:
You can then iterate over int_generator like you would any list, without having the entire list in memory at once. Note, however, that you cannot subscript or slice a generator, but you can use it in many useful situations.
You can clean up the example so that the start variable does not need to be global. It just can't be local to the mod_start function.
Edit: You don't have to use the generator comprehension to get a generator. You can also use a generator function with the yield expression, like THC4k did. That avoids the start variable scope issue and is probably a little cleaner. You can also get a list from a generator at any time by passing it to the list() built-in function.
不评论这个的性能,但是你可以在这里使用reduce。
给你你想要的。
No comment on the performance of this, but you can use reduce here.
gets you what you want.