比较Python中巨大二维列表中一个列表的值,最快的方法?
我想比较一个列表的值是否存在于其他列表的值中。它们很大(50k + 项目,来自数据库)。
编辑:
我还想将重复的记录标记为重复=真并将它们保留在表中以供以后参考。
这里的列表是这样的:
n_emails=[db_id,checksum for id,checksum in search_results]
#I want to compare checksum if exist inside same list or other list and retrieve id (db_id , if exist)
#example : n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds],[3,'deadbeef101'],[5,'CAFEBABE010']]
#in this case i want to retrive id 1 and 5 coz they are same checksum
for m in n_emails:
dups=_getdups(n_emails,m[1],m[0])
n_dups=[casesdb.duplicates.insert( **dup ) for dup in dups]
if n_dups:
print "Dupe Found"
casesdb(casesdb.email_data.id == m[0]).update(duplicated=True)
def _getdups(old_lst,em_md5,em_id):
dups=[]
for old in old_lst:
if em_md5==old[0] and old[1]!=em_id:
dups.append(dict(org_id=old[1],md5hash=old[0],dupID=em_id,))
return dups
但它似乎太长并且在更大的列表中(50k vs 50k 记录+)它运行了超过 5000 秒并且从未完成,似乎永远不会结束循环? 我运行的服务器有 4 GB 内存和 4 个核心。显然我做错了什么。
请帮忙..非常感谢!
已解决:
字典索引映射快得多! (当 mysql 表没有索引时,请注意我没有针对索引表进行测试)。
20 秒 vs 30 毫秒 = 20*1000 / 30 = 666 次!哈哈
I want to compare if value of one list exist in value of other list.They are huge (50k + items, from database).
EDIT:
I also want to mark the record which is duplicated as duplicate=True and keep them in the table for later refrence.
here how the lists are:
n_emails=[db_id,checksum for id,checksum in search_results]
#I want to compare checksum if exist inside same list or other list and retrieve id (db_id , if exist)
#example : n_emails= [[1,'CAFEBABE010'],[2,'bfeafe3df1ds],[3,'deadbeef101'],[5,'CAFEBABE010']]
#in this case i want to retrive id 1 and 5 coz they are same checksum
for m in n_emails:
dups=_getdups(n_emails,m[1],m[0])
n_dups=[casesdb.duplicates.insert( **dup ) for dup in dups]
if n_dups:
print "Dupe Found"
casesdb(casesdb.email_data.id == m[0]).update(duplicated=True)
def _getdups(old_lst,em_md5,em_id):
dups=[]
for old in old_lst:
if em_md5==old[0] and old[1]!=em_id:
dups.append(dict(org_id=old[1],md5hash=old[0],dupID=em_id,))
return dups
But it seems too long and in larger list (50k vs 50k records+) It ran for over 5000 seconds and never done , seems never ending loop?
The server i running have 4 GB of ram and 4 cores. Obviously i am doing something wrong.
Please help .. thanks a lot!
SOLVED:
Dict Index Mapping is way a lot faster! (When mysql table is not indexed, plese note i have not test against indexed table).
Its 20 secs vs 30 miliseconds = 20*1000 / 30 = 666 Times! LOL
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
最快的方法是使用这样的字典:
这几乎是哈希连接的算法
,这将是 sql/python 解决方案,我采用重复列并使用它来存储该消息被认为是重复的
电子邮件表至少是:
the fastest way is to use a dict like this:
this is nearly the algorithm for a hash join
would be the sql/python solution in this i take the duplicate column and use it to store which message this one is thought to be a duplicate of
the emails table would be at least:
你做错的是:
您应该做的是在校验和上建立索引。制作一个映射
checksum -> 的字典条目
。当您插入条目时,检查校验和是否已存在,如果存在,则该条目是重复的。或者你只是使用你的数据库,他们喜欢索引。
What you're doing wrong is:
What you should do is build an index over the checksums. Make a dict that maps
checksum -> entry
. When you insert the entries check if the checksum exists already, if so the entry is a duplicate.Or you simply use your database, they love indexing.
您最好使用 SQL 查找重复项。例如,请参阅此页面描述如何查找重复项。
将所有这些结果拉入 Python 并处理它们永远不会很快,但如果必须的话,最好的选择是拥有一个 ID 校验和字典:
You'd be better off looking for duplicates with SQL. For example, see this page describing how to find duplicates.
Pulling all of those results into Python and processing them is never going to be very fast, but if you must, your best bet is to have a dictionary of checksums to IDs:
最后感谢所有答案,我发现字典映射非常快!比 SQL 查询快得多。
这是我的 SQL 查询测试(看起来很尴尬,但它是 Web2pyDAL 查询的语法)。
我测试了 3500 条记录,并且仅针对超过 250000 条记录进行了字典映射。
结果:
约28秒
这是基于 Dan D 的基于字典的索引图
结果:
只有什么? 30 毫秒!
让我们看看它对 250k 条记录的重复数据删除做了什么!
不到一分钟!!
感谢所有答案,我想选择所有给我指出正确方法的人,但 Dan D 给了我最详细的答案!谢谢丹!
Finally thanks to all answers I found that dict mapping is insanely fast! Way a lot faster than SQL queries.
Here is my SQL query test (it will seem awkward, but it is syntax of Web2pyDAL's queries).
I tested both for 3500 records and only dict mapping against over 250000 records.
Resullts in:
about 28 secs
heres the dict based indexing map based on Dan D
results in :
only what? 30 ms!
and lets see what it done against de-duping 250k records!
Less Than a min!!
Thanks to all answers , i would like to select all those who pointed me correct way , but Dan D give me most detailed answer! Thanks Dan!