比较Python中巨大二维列表中一个列表的值,最快的方法?

发布于 2024-10-05 16:08:46 字数 1165 浏览 2 评论 0原文

我想比较一个列表的值是否存在于其他列表的值中。它们很大(50k + 项目,来自数据库)。

编辑:

我还想将重复的记录标记为重复=真并将它们保留在表中以供以后参考。

这里的列表是这样的:

n_emails=[db_id,checksum for id,checksum in search_results]
#I want to compare checksum if exist inside same list or other list and retrieve id (db_id , if exist)
#example : n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds],[3,'deadbeef101'],[5,'CAFEBABE010']] 
#in this case i want to retrive id 1 and 5 coz they are same checksum

for m in n_emails:
    dups=_getdups(n_emails,m[1],m[0])           
    n_dups=[casesdb.duplicates.insert( **dup ) for dup in dups]
    if n_dups:
        print "Dupe Found"
        casesdb(casesdb.email_data.id == m[0]).update(duplicated=True)

def _getdups(old_lst,em_md5,em_id):
    dups=[]
    for old in old_lst:
        if em_md5==old[0] and old[1]!=em_id:
            dups.append(dict(org_id=old[1],md5hash=old[0],dupID=em_id,))
    return dups

但它似乎太长并且在更大的列表中(50k vs 50k 记录+)它运行了超过 5000 秒并且从未完成,似乎永远不会结束循环? 我运行的服务器有 4 GB 内存和 4 个核心。显然我做错了什么。

请帮忙..非常感谢!

已解决:

字典索引映射快得多! (当 mysql 表没有索引时,请注意我没有针对索引表进行测试)。

20 秒 vs 30 毫秒 = 20*1000 / 30 = 666 次!哈哈

I want to compare if value of one list exist in value of other list.They are huge (50k + items, from database).

EDIT:

I also want to mark the record which is duplicated as duplicate=True and keep them in the table for later refrence.

here how the lists are:

n_emails=[db_id,checksum for id,checksum in search_results]
#I want to compare checksum if exist inside same list or other list and retrieve id (db_id , if exist)
#example : n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds],[3,'deadbeef101'],[5,'CAFEBABE010']] 
#in this case i want to retrive id 1 and 5 coz they are same checksum

for m in n_emails:
    dups=_getdups(n_emails,m[1],m[0])           
    n_dups=[casesdb.duplicates.insert( **dup ) for dup in dups]
    if n_dups:
        print "Dupe Found"
        casesdb(casesdb.email_data.id == m[0]).update(duplicated=True)

def _getdups(old_lst,em_md5,em_id):
    dups=[]
    for old in old_lst:
        if em_md5==old[0] and old[1]!=em_id:
            dups.append(dict(org_id=old[1],md5hash=old[0],dupID=em_id,))
    return dups

But it seems too long and in larger list (50k vs 50k records+) It ran for over 5000 seconds and never done , seems never ending loop?
The server i running have 4 GB of ram and 4 cores. Obviously i am doing something wrong.

Please help .. thanks a lot!

SOLVED:

Dict Index Mapping is way a lot faster! (When mysql table is not indexed, plese note i have not test against indexed table).

Its 20 secs vs 30 miliseconds = 20*1000 / 30 = 666 Times! LOL

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

无声情话 2024-10-12 16:08:46

最快的方法是使用这样的字典:

n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds'],[3,'deadbeef101'],[5,'CAFEBABE010']]

d = {}
for id, hash in n_emails:
    if hash not in d:
        d[hash] = [id]
    else:
        d[hash].append(id)

for hash, ids in d:
    if len(ids) > 1:
       print hash, ids

这几乎是哈希连接的算法


for hash, count in select hash, count(id) as num from emails group by num having num > 1:
    first = None
    for index, id in enumerate(select id from emails where hash=hash sort by desc id):
        if index == 0:
            first = id
            continue
        update emails set duplicate=first where id=id

,这将是 sql/python 解决方案,我采用重复列并使用它来存储该消息被认为是重复的

电子邮件表至少是:

create table emails (id, hash, duplicate default null)

the fastest way is to use a dict like this:

n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds'],[3,'deadbeef101'],[5,'CAFEBABE010']]

d = {}
for id, hash in n_emails:
    if hash not in d:
        d[hash] = [id]
    else:
        d[hash].append(id)

for hash, ids in d:
    if len(ids) > 1:
       print hash, ids

this is nearly the algorithm for a hash join


for hash, count in select hash, count(id) as num from emails group by num having num > 1:
    first = None
    for index, id in enumerate(select id from emails where hash=hash sort by desc id):
        if index == 0:
            first = id
            continue
        update emails set duplicate=first where id=id

would be the sql/python solution in this i take the duplicate column and use it to store which message this one is thought to be a duplicate of

the emails table would be at least:

create table emails (id, hash, duplicate default null)
很糊涂小朋友 2024-10-12 16:08:46

你做错的是:

  • 你可能可以直接从数据库获取结果。它比 Python 快得多。
  • 您正在对校验和进行线性搜索,这意味着 50k 条目中的每一个都会与其他 50k 条目中的每一个进行比较……这是一个巨大的比较次数。

您应该做的是在校验和上建立索引。制作一个映射 checksum -> 的字典条目。当您插入条目时,检查校验和是否已存在,如果存在,则该条目是重复的。

或者你只是使用你的数据库,他们喜欢索引。

What you're doing wrong is:

  • You can probably get the result directly from the database. It's much faster than Python.
  • You are doing linear search for the checksums, which means that each of the 50k entries gets compared to each of the other 50k entries ... that is a huge number of comparisons.

What you should do is build an index over the checksums. Make a dict that maps checksum -> entry. When you insert the entries check if the checksum exists already, if so the entry is a duplicate.

Or you simply use your database, they love indexing.

2024-10-12 16:08:46

您最好使用 SQL 查找重复项。例如,请参阅此页面描述如何查找重复项

将所有这些结果拉入 Python 并处理它们永远不会很快,但如果必须的话,最好的选择是拥有一个 ID 校验和字典:

got_checksums = {}
for id, checksum in emails:
    if checksum in got_checksums:
        print id, got_checksums[checksum]
    else:
        got_checksums[checksum] = id

You'd be better off looking for duplicates with SQL. For example, see this page describing how to find duplicates.

Pulling all of those results into Python and processing them is never going to be very fast, but if you must, your best bet is to have a dictionary of checksums to IDs:

got_checksums = {}
for id, checksum in emails:
    if checksum in got_checksums:
        print id, got_checksums[checksum]
    else:
        got_checksums[checksum] = id
这个俗人 2024-10-12 16:08:46

最后感谢所有答案,我发现字典映射非常快!比 SQL 查询快得多。

这是我的 SQL 查询测试(看起来很尴尬,但它是 Web2pyDAL 查询的语法)。

我测试了 3500 条记录,并且仅针对超过 250000 条记录进行了字典映射。

print "de_duping started at %s" % str( datetime.datetime.now() )

dupe_n = 0
l_dupe_n = 0
for em_hash in n_emails:
    dup_ids=casesdb(casesdb.email_data.MD5Hash==em_hash[1]).select(casesdb.email_data.id)
    if dup_ids>1:
        dupe_n+=1

print "Email Dupes %s" % (dupe_n)
print "Local de_duping ended at %s" % str( datetime.datetime.now() )

结果:

de_duping started at 2010-12-02 03:39:24.610888
Email Dupes 3067
Local de_duping ended at 2010-12-02 03:39:52.669849

约28秒
这是基于 Dan D 的基于字典的索引图

    print "de_duping started at %s" % str( datetime.datetime.now() )
    for id, hash in em_hash:

            if hash not in dedupe_emails:

                dedupe_emails[hash] = [id]
            else:

                dedupe_emails[hash].append( id )
                dupe_n += 1
                casesdb( casesdb.email_data.id == id ).update( duplicated = True )

    print "Email Dupes %s" % (dupe_n)
    print "Local de_duping ended at %s" % str( datetime.datetime.now() )

结果:

de_duping started at 2010-12-02 03:41:21.505235
Email Dupes 2591 # this is accurate as selecting from database regards first match as duplicate too
Local de_duping ended at 2010-12-02 03:41:21.531899

只有什么? 30 毫秒!

让我们看看它对 250k 条记录的重复数据删除做了什么!

de_duping at 2010-12-02 03:44:20.120880
Email Dupes 93567 
Local de_duping ended at 2010-12-02 03:45:12.612449

不到一分钟!!

感谢所有答案,我想选择所有给我指出正确方法的人,但 Dan D 给了我最详细的答案!谢谢丹!

Finally thanks to all answers I found that dict mapping is insanely fast! Way a lot faster than SQL queries.

Here is my SQL query test (it will seem awkward, but it is syntax of Web2pyDAL's queries).

I tested both for 3500 records and only dict mapping against over 250000 records.

print "de_duping started at %s" % str( datetime.datetime.now() )

dupe_n = 0
l_dupe_n = 0
for em_hash in n_emails:
    dup_ids=casesdb(casesdb.email_data.MD5Hash==em_hash[1]).select(casesdb.email_data.id)
    if dup_ids>1:
        dupe_n+=1

print "Email Dupes %s" % (dupe_n)
print "Local de_duping ended at %s" % str( datetime.datetime.now() )

Resullts in:

de_duping started at 2010-12-02 03:39:24.610888
Email Dupes 3067
Local de_duping ended at 2010-12-02 03:39:52.669849

about 28 secs
heres the dict based indexing map based on Dan D

    print "de_duping started at %s" % str( datetime.datetime.now() )
    for id, hash in em_hash:

            if hash not in dedupe_emails:

                dedupe_emails[hash] = [id]
            else:

                dedupe_emails[hash].append( id )
                dupe_n += 1
                casesdb( casesdb.email_data.id == id ).update( duplicated = True )

    print "Email Dupes %s" % (dupe_n)
    print "Local de_duping ended at %s" % str( datetime.datetime.now() )

results in :

de_duping started at 2010-12-02 03:41:21.505235
Email Dupes 2591 # this is accurate as selecting from database regards first match as duplicate too
Local de_duping ended at 2010-12-02 03:41:21.531899

only what? 30 ms!

and lets see what it done against de-duping 250k records!

de_duping at 2010-12-02 03:44:20.120880
Email Dupes 93567 
Local de_duping ended at 2010-12-02 03:45:12.612449

Less Than a min!!

Thanks to all answers , i would like to select all those who pointed me correct way , but Dan D give me most detailed answer! Thanks Dan!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文