如何根据单独列表中每个字符串的子字符串来获取两个列表之间的差异
我有两个长列表,一个来自日志文件,其中包含格式类似于
201001050843 blah blah blah <[电子邮件受保护]> blah blah
和 csv 格式的第二个文件。我需要生成 file2 中日志文件中不包含电子邮件地址的所有条目的列表,同时保持 csv 格式。
示例
日志文件包含:
201001050843 blah blah blah <[email protected]> blah blah
201001050843 blah blah blah <[email protected]> blah blah
File2 包含:
156456,bob,sagget,[email protected],4564456
156464,bob,otherguy,[email protected],45644562
输出应该是:
156464,bob,otherguy,[email protected],45644562
目前我从日志中抓取电子邮件并将它们加载到另一个列表中:
sent_emails =[]
for line in sent:
try:
temp1= line.index('<')
temp2 = line.index('>')
sent_emails.append(line[temp1+1:temp2])
except ValueError:
pass
然后与 file2 进行比较:
lista = mail_lista.readlines()
for line in lista:
temp = line.split()
for thing in temp:
try:
if thing.index('@'):
if thing in sent_emails:
lista.remove(temp)
except ValueError:
pass
newa.writelines(lista)
或:
for line in mail_listb:
temp = line.split()
for thing in temp:
try:
if thing.index('@'):
if thing not in sent_emails:
newb.write(line)
except ValueError:
pass
但是两者都返回 file2 的所有内容!
感谢您提供的任何帮助。
编辑:感谢您对套件的建议,它产生的速度差异比我想象的要大。哈希表真是太棒了!从现在开始我肯定会更频繁地使用集合。
I have two long list, one from a log file that contains lines formatted like
201001050843 blah blah blah <[email protected]> blah blah
and a second file in csv format. I need to generate a list of all the entries in file2 that do not contain a email address in the log file, while maintaining the csv format.
Example
Log file contains:
201001050843 blah blah blah <[email protected]> blah blah
201001050843 blah blah blah <[email protected]> blah blah
File2 contains:
156456,bob,sagget,[email protected],4564456
156464,bob,otherguy,[email protected],45644562
the output should be:
156464,bob,otherguy,[email protected],45644562
Currently I grab the emails from the log and load them into another list with:
sent_emails =[]
for line in sent:
try:
temp1= line.index('<')
temp2 = line.index('>')
sent_emails.append(line[temp1+1:temp2])
except ValueError:
pass
And then compare to file2 with either:
lista = mail_lista.readlines()
for line in lista:
temp = line.split()
for thing in temp:
try:
if thing.index('@'):
if thing in sent_emails:
lista.remove(temp)
except ValueError:
pass
newa.writelines(lista)
or:
for line in mail_listb:
temp = line.split()
for thing in temp:
try:
if thing.index('@'):
if thing not in sent_emails:
newb.write(line)
except ValueError:
pass
However both return all of file2!
Thanks for any help you can give.
EDIT: Thanks for the recommendations for sets, it made a larger speed difference than I would have thought possible. Way to go hash tables! I will definitively be using sets more often from now on.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以像这样创建一组电子邮件,然后:
仅当 CSV 文件中的电子邮件始终位于位置 4 时,此方法才有效。
fileinput 启用就地编辑。
并使用一组电子邮件而不是亚伦所说的列表,不仅是因为速度,而且还可以消除重复。
You could create the set of emails as you do and then:
This only works, if the email in the CSV file is always at position 4.
fileinput enables in place editing.
And use a set for the emails instead of a list as Aaron said, not only because of speed but also to eliminate duplicates.
line.split()
在空白处分割。使用line.split(',')
代替。另外:行的顺序重要吗?如果没有,那么您应该真正使用
set()
而不是列表。这将使代码更快。line.split()
splits at whitespace. Useline.split(',')
instead.Also: Does the order of the lines matter? If not, then you should really use a
set()
instead of a list. That will make the code much faster.这是另一种方法,对电子邮件地址的位置进行简约检查。
输出
here's another way, with minimalistic check on email addr's position.
output