Python中电子邮件列表的高效处理
我有一个很长的电子邮件列表,我想对其进行处理:
- 将好电子邮件与坏电子邮件分开,并
- 删除重复项,但将所有非重复项保持相同的顺序。
这就是我到目前为止所拥有的:
email_list = ["[email protected]", "invalid_email", ...]
email_set = set()
bad_emails = []
good_emails = []
dups = False
for email in email_list:
if email in email_set:
dups = True
continue
email_set.add(email)
if email_re.match(email):
good_emails.append(email)
else:
bad_emails.append(email)
我希望这段代码尽可能快,并且不太重要,以最大限度地减少内存需求。有没有办法在Python中改进这个?也许使用列表理解或迭代器?
编辑:抱歉!忘记提及这是 Python 2.5,因为这是针对 GAE 的。
email_re 来自 django.core.validators
I have a very long list of emails that I would like to process to:
- separate good emails from bad emails, and
- remove duplicates but keep all the non-duplicates in the same order.
This is what I have so far:
email_list = ["[email protected]", "invalid_email", ...]
email_set = set()
bad_emails = []
good_emails = []
dups = False
for email in email_list:
if email in email_set:
dups = True
continue
email_set.add(email)
if email_re.match(email):
good_emails.append(email)
else:
bad_emails.append(email)
I would like this chunk of code to be as fast as possible, and of less importance, to minimize memory requirements. Is there a way to improve this in Python? Maybe using list comprehensions or iterators?
EDIT: Sorry! Forget to mention that this is Python 2.5 since this is for GAE.
email_re is from django.core.validators
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
查看: Python 有有序集吗? ,然后选择您的实现喜欢。
所以只是:
可能是您可以实现的最快、最简单的解决方案。
Look at: Does Python have an ordered set? , and select an implementation you like.
So just:
Probably is the fastest and simpliest solution you can achieve.
我想不出任何方法可以加快你的速度。使用
set
来跟踪事物的速度很快,使用list
存储列表的速度也很快。我喜欢
OrderedSet
解决方案,但我怀疑OrderedSet
的 Python 实现会比您编写的更快。您可以使用
OrderedDict
来解决这个问题。但这是为 Python 2.7 添加的。您可以使用食谱(例如:http://code.activestate.com/recipes/576693/)添加OrderedDict
但我再次认为它不会比你拥有的更快。我正在尝试考虑用 C 实现的 Python 模块来解决这个问题。我认为这是击败你的代码的唯一希望。但我什么也没想。
如果您可以摆脱 dups 标志,只需运行更少的 Python 代码就会更快。
有趣的问题。祝你好运。
I can't think of any way to speed up what you have. It's fast to use a
set
to keep track of things, and it's fast to use alist
to store a list.I like the
OrderedSet
solution, but I doubt a Python implementation ofOrderedSet
would be faster than what you wrote.You could use an
OrderedDict
to solve this problem. But that was added for Python 2.7. You could use a recipe (like: http://code.activestate.com/recipes/576693/) to addOrderedDict
but again I don't think it would be any faster than what you have.I'm trying to think of a Python module that is implemented in C to solve this problem. I think that's the only hope of beating your code. But I haven't thought of anything.
If you can get rid of the
dups
flag, it will be faster simply by running less Python code.Interesting question. Good luck.