Python：Zope 的 BTree OOSet、IISet 等...对于此要求有效吗？

发布于 2024-07-29 02:04:06 字数 1356 浏览 2 评论 0原文

我又问了一个问题： https://stackoverflow.com/问题/1180240/在 python 中排序 1m 记录的最佳方式我试图确定对 100 万条记录进行排序的最佳方法。就我而言，我需要能够向集合中添加其他项目并对其进行重新排序。有人建议我尝试使用 Zope 的 BTree 来完成此任务。读完一些书后，我对要放入一组数据感到有点困惑。

基本上，对于每条记录我都有两条数据。 1. 映射到用户的唯一 ID，2. 用于排序的感兴趣值。

我发现我可以将项目作为元组添加到 OOSet，其中排序的值位于索引 0 处。因此，(200, 'id1'),(120, 'id2'),(400, ' id3') 并且结果集将按 id2、id1 和 id3 的顺序排序。

然而，对此的部分要求是每个 id 在集合中仅出现一次。我将定期向集合中添加额外的数据，新数据可能包含也可能不包含重复的“id”。如果它们重复，我想更新该值而不是添加额外的条目。因此，根据上面的元组，我可能会将 (405, 'id1'),(10, 'id4') 添加到集合中，并希望输出具有 id4, id2, id3，id1 按顺序。

有关如何实现此目标的任何建议。抱歉我在这个问题上还是个新手。

* 编辑 - 附加信息 *

这是项目中的一些实际代码：

for field in lb_fields:
        t = time.time()
        self.data[field] = [ (v[field], k) for k, v in self.foreign_keys.iteritems() ]
        self.data[field].sort(reverse=True)
        print "Added %s: %03.5f seconds" %(field, (time.time() - t))

foreign_keys 是字典中的原始数据，其中每个 id 作为键，附加数据的字典作为值。 data 是包含排序数据列表的字典。

附带说明一下，随着 lb_fields 中 for 字段的每次迭代运行，排序时间会增加 - 增加幅度不大……但很明显。在为 16 个字段中的每一个字段排序 100 万条记录后，它使用大约 4 Gigs 或 RAM。最终这将在具有 48 Gigs 的机器上运行。

原文

I asked another question:
https://stackoverflow.com/questions/1180240/best-way-to-sort-1m-records-in-python
where I was trying to determine the best approach for sorting 1 million records. In my case I need to be able to add additional items to the collection and have them resorted. It was suggested that I try using Zope's BTrees for this task. After doing some reading I am a little stumped as to what data I would put in a set.

Basically, for each record I have two pieces of data. 1. A unique ID which maps to a user and 2. a value of interest for sorting on.

I see that I can add the items to an OOSet as tuples, where the value for sorting on is at index 0. So, (200, 'id1'),(120, 'id2'),(400, 'id3') and the resulting set would be sorted with id2, id1 and id3 in order.

However, part of the requirement for this is that each id appear only once in the set. I will be adding additional data to the set periodically and the new data may or may not include duplicated 'ids'. If they are duplicated I want to update the value and not add an additional entry. So, based on the tuples above, I might add (405, 'id1'),(10, 'id4') to the set and would want the output to have id4, id2, id3, id1 in order.

Any suggestions on how to accomplish this. Sorry for my newbness on the subject.

* EDIT - additional info *

Here is some actual code from the project:

for field in lb_fields:
        t = time.time()
        self.data[field] = [ (v[field], k) for k, v in self.foreign_keys.iteritems() ]
        self.data[field].sort(reverse=True)
        print "Added %s: %03.5f seconds" %(field, (time.time() - t))

foreign_keys is the original data in a dictionary with each id as the key and a dictionary of the additional data as the value. data is a dictionary containing the lists of sorted data.

As a side note, as each itereation of the for field in lb_fields runs, the time to sort increases - not by much... but it is noticeable. After 1 million records have been sorted for each of the 16 fields it is using about 4 Gigs or RAM. Eventually this will run on a machine with 48 Gigs.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小傻瓜 2024-08-05 02:04:06

我不认为 BTree 或其他传统的排序数据结构（红黑树等）会帮助你，因为它们通过键而不是相应的值来保持顺序——换句话说，它们保证唯一的字段是相同的他们订购的一个。您的要求不同，因为您希望在一个领域具有唯一性，但在另一个领域则希望排序。

您的性能要求是什么？通过基于 Python 字典的相当简单的纯 Python 实现来实现唯一性和 Python 排序，在一台速度不是很快的笔记本电脑上，我只需 5 秒即可完成原始构造（本质上是对百万个元素进行排序，从它们作为字典开始），大约 9 秒用于“更新”20,000 个新的 id/值对，其中一半“重叠”（从而覆盖）现有 id，一半是新的（我可以以更快的方式实现更新，大约 6.5 秒，但是该实现有一个异常：如果“新”对之一与“旧”对之一完全相同，无论是 id 还是值，它都是重复的 - 防止这种“相同的重复”是什么让我从 6.5 秒开始到 9，我想您也需要同样的预防措施）。

这些 5 秒和 9 秒的时间距离您的要求有多远（考虑到您将运行的机器的实际速度与 2.4 GHz Core Duo、2GB RAM 以及该笔记本电脑的典型笔记本电脑性能问题）正在使用）？ IOW，它是否足够接近“攻击距离”，值得修补并尝试挤出最后几个周期，或者您是否需要更快的性能几个数量级？

我尝试了其他几种方法（使用 SQL DB、使用 C++ 及其 std::sort &c，...），但它们都比较慢，所以如果您需要更高的性能，我不确定您需要什么可以做。

编辑：由于OP说这个性能很好，但他无法实现接近它的任何目标，我想我最好展示我用来测量这些时间的脚本......：

import gc
import operator
import random
import time


nk = 1000

def popcon(d):
  for x in xrange(nk*1000):
    d['id%s' % x] = random.randrange(100*1000)

def sorted_container():
  ctr = dict()
  popcon(ctr)
  start = time.time()
  ctr_sorted = ctr.items()
  ctr_sorted.sort(key=operator.itemgetter(1))
  stend = time.time()
  return stend-start, ctr_sorted

def do_update(ctr, newones):
  start = time.time()
  dicol = dict(ctr)
  ctr.extend((k,v) for (k,v) in newones if v!=dicol.get(k,None))
  dicnu = dict(newones)
  ctr.sort(key=operator.itemgetter(1))
  newctr = [(k,v) for (k,v) in ctr if v==dicnu.get(k,v)]
  stend = time.time()
  return stend-start, newctr

def main():
  random.seed(12345)
  for x in range(3):
    duration, ctr = sorted_container()
    print 'dict-to-sorted, %d: %.2f sec, len=%d' % (x, duration, len(ctr))
    newones = [('id%s' % y, random.randrange(nk*100))
                for y in xrange(nk*990,nk*1010)]
    duration, ctr = do_update(ctr, newones)
    print 'updt-to-sorted, %d: %.2f sec, len=%d' % (x, duration, len(ctr))
    del ctr
    gc.collect()

main()

这是典型的运行：

$ time python som.py
dict-to-sorted, 0: 5.01 sec, len=1000000
updt-to-sorted, 0: 9.78 sec, len=1010000
dict-to-sorted, 1: 5.02 sec, len=1000000
updt-to-sorted, 1: 9.12 sec, len=1010000
dict-to-sorted, 2: 5.03 sec, len=1000000
updt-to-sorted, 2: 9.12 sec, len=1010000

real    0m54.073s
user    0m52.464s
sys 0m1.258s

显然，总体运行时间比我测量的总数多几秒，因为它包括用随机数填充容器、随机生成“新数据”、销毁和垃圾所需的时间-在每次运行结束时收集东西，等等。

这是在配备 Mac OS X 10.5.7、2.4 GHz Intel Core Duo 和 2GB RAM 的 Macbook 上使用系统提供的 Python 2.5.2（当我使用不同版本的 Python 时，时间不会有太大变化）。

I don't think BTrees or other traditional sorted data structures (red-black trees, etc) will help you, because they keep order by key, not by corresponding value -- in other words, the field they guarantee as unique is the same one they order by. Your requirements are different, because you want uniqueness along one field, but sortedness by the other.

What are your performance requirements? With a rather simple pure Python implementation based on Python dicts for uniqueness and Python sorts, on a not-blazingly-fast laptop, I get 5 seconds for the original construction (essentially a sort over the million elements, starting with them as a dict), and about 9 seconds for the "update" with 20,000 new id/value pairs of which half "overlap" (thus overwrite) an existing id and half are new (I can implement the update in a faster way, about 6.5 seconds, but that implementation has an anomaly: if one of the "new" pairs is exactly identical to one of the "old" ones, both id and value, it's duplicated -- warding against such "duplication of identicals" is what pushes me from 6.5 seconds to 9, and I imagine you would need the same kind of precaution).

How far are these 5-and-9 seconds times from your requirements (taking into account the actual speed of the machine you'll be running on vs the 2.4 GHz Core Duo, 2GB of RAM, and typical laptop performance issues of this laptop I'm using)? IOW, is it close enough to "striking distance" to be worth tinkering and trying to squeeze a last few cycles out of, or do you need orders of magnitude faster performance?

I've tried several other approaches (with a SQL DB, with C++ and its std::sort &c, ...) but they're all slower, so if you need much higher performance I'm not sure what you could do.

Edit: since the OP says this performance would be fine but he can't achieve anywhere near it, I guess I'd best show the script I used to measure these times...:

import gc
import operator
import random
import time


nk = 1000

def popcon(d):
  for x in xrange(nk*1000):
    d['id%s' % x] = random.randrange(100*1000)

def sorted_container():
  ctr = dict()
  popcon(ctr)
  start = time.time()
  ctr_sorted = ctr.items()
  ctr_sorted.sort(key=operator.itemgetter(1))
  stend = time.time()
  return stend-start, ctr_sorted

def do_update(ctr, newones):
  start = time.time()
  dicol = dict(ctr)
  ctr.extend((k,v) for (k,v) in newones if v!=dicol.get(k,None))
  dicnu = dict(newones)
  ctr.sort(key=operator.itemgetter(1))
  newctr = [(k,v) for (k,v) in ctr if v==dicnu.get(k,v)]
  stend = time.time()
  return stend-start, newctr

def main():
  random.seed(12345)
  for x in range(3):
    duration, ctr = sorted_container()
    print 'dict-to-sorted, %d: %.2f sec, len=%d' % (x, duration, len(ctr))
    newones = [('id%s' % y, random.randrange(nk*100))
                for y in xrange(nk*990,nk*1010)]
    duration, ctr = do_update(ctr, newones)
    print 'updt-to-sorted, %d: %.2f sec, len=%d' % (x, duration, len(ctr))
    del ctr
    gc.collect()

main()

and this is a typical run:

$ time python som.py
dict-to-sorted, 0: 5.01 sec, len=1000000
updt-to-sorted, 0: 9.78 sec, len=1010000
dict-to-sorted, 1: 5.02 sec, len=1000000
updt-to-sorted, 1: 9.12 sec, len=1010000
dict-to-sorted, 2: 5.03 sec, len=1000000
updt-to-sorted, 2: 9.12 sec, len=1010000

real    0m54.073s
user    0m52.464s
sys 0m1.258s

the overall elapsed time being a few seconds more than the totals I'm measuring, obviously, because it includes the time needed to populate the container with random numbers, generate the "new data" also randomly, destroy and garbage-collect things at the end of each run, and so forth.

This is with the system-supplied Python 2.5.2 on a Macbook with Mac OS X 10.5.7, 2.4 GHz Intel Core Duo, and 2GB of RAM (times don't change much when I use different versions of Python).

回复收藏 0 原文

琉璃梦幻 2024-08-05 02:04:06

完全可以解决你的问题。为此，您应该注意 Python 中的容器类型总是通过调用对象的方法来比较对象。因此，您应该执行以下操作：

class Record:
    'Combination of unique part and sort part.'
    def __init__(self, unique, sort):
        self.unique = unique
        self.sort = sort

    def __hash__(self):
        # Hash should be implemented if __eq__ is implemented.
        return hash(self.unique)

    def __eq__(self, other):
        return self.unique == other.unique

    def __lt__(self, other):
        return self.sort < other.sort

 records = btree((Record(u, s) for u, s in zip(unique_data, sort_data)))

 print(records.pop())

注意：

根据您最喜欢的容器类型的实现方式，您可能还需要为 !=、<=、>、>= 添加方法，
这不会破坏 == 之间的关系且 <= 只要 x.unique == y.unique ==> x.sort == y.sort

It is perfectly possible to solve your problem. For this you should just note that the container types in Python always compare objects by calling their methods. Therefore you should do something like:

class Record:
    'Combination of unique part and sort part.'
    def __init__(self, unique, sort):
        self.unique = unique
        self.sort = sort

    def __hash__(self):
        # Hash should be implemented if __eq__ is implemented.
        return hash(self.unique)

    def __eq__(self, other):
        return self.unique == other.unique

    def __lt__(self, other):
        return self.sort < other.sort

 records = btree((Record(u, s) for u, s in zip(unique_data, sort_data)))

 print(records.pop())

Notes:

depending on how your favorite container type is implemented, you might need to add methods for !=, <=, >, >= as well
this will not break the relationship between == and <= as long as x.unique == y.unique ==> x.sort == y.sort

回复收藏 0 原文

~没有更多了~