如何根据单独列表中每个字符串的子字符串来获取两个列表之间的差异

发布于 2024-08-17 01:33:09 字数 2316 浏览 6 评论 0原文

我有两个长列表,一个来自日志文件,其中包含格式类似于

201001050843 blah blah blah <[电子邮件受保护]> blah blah

和 csv 格式的第二个文件。我需要生成 file2 中日志文件中不包含电子邮件地址的所有条目的列表,同时保持 csv 格式。

示例
日志文件包含:

201001050843 blah blah blah <[email protected]> blah blah
201001050843 blah blah blah <[email protected]> blah blah

File2 包含:

156456,bob,sagget,[email protected],4564456
156464,bob,otherguy,[email protected],45644562

输出应该是:

156464,bob,otherguy,[email protected],45644562

目前我从日志中抓取电子邮件并将它们加载到另一个列表中:

sent_emails =[]
for line in sent:
    try:
        temp1= line.index('<')
        temp2 = line.index('>')
        sent_emails.append(line[temp1+1:temp2])
    except ValueError:
        pass

然后与 file2 进行比较:

lista = mail_lista.readlines()
for line in lista:
    temp = line.split()
    for thing in temp:
        try:
            if thing.index('@'):
                if thing in sent_emails:
                    lista.remove(temp)
        except ValueError:
            pass
newa.writelines(lista)

或:

for line in mail_listb:
    temp = line.split()
    for thing in temp:
        try:
            if thing.index('@'):
                if thing not in sent_emails:
                    newb.write(line)
        except ValueError:
            pass

但是两者都返回 file2 的所有内容!

感谢您提供的任何帮助。

编辑:感谢您对套件的建议,它产生的速度差异比我想象的要大。哈希表真是太棒了!从现在开始我肯定会更频繁地使用集合。

I have two long list, one from a log file that contains lines formatted like

201001050843 blah blah blah <[email protected]> blah blah

and a second file in csv format. I need to generate a list of all the entries in file2 that do not contain a email address in the log file, while maintaining the csv format.

Example
Log file contains:

201001050843 blah blah blah <[email protected]> blah blah
201001050843 blah blah blah <[email protected]> blah blah

File2 contains:

156456,bob,sagget,[email protected],4564456
156464,bob,otherguy,[email protected],45644562

the output should be:

156464,bob,otherguy,[email protected],45644562

Currently I grab the emails from the log and load them into another list with:

sent_emails =[]
for line in sent:
    try:
        temp1= line.index('<')
        temp2 = line.index('>')
        sent_emails.append(line[temp1+1:temp2])
    except ValueError:
        pass

And then compare to file2 with either:

lista = mail_lista.readlines()
for line in lista:
    temp = line.split()
    for thing in temp:
        try:
            if thing.index('@'):
                if thing in sent_emails:
                    lista.remove(temp)
        except ValueError:
            pass
newa.writelines(lista)

or:

for line in mail_listb:
    temp = line.split()
    for thing in temp:
        try:
            if thing.index('@'):
                if thing not in sent_emails:
                    newb.write(line)
        except ValueError:
            pass

However both return all of file2!

Thanks for any help you can give.

EDIT: Thanks for the recommendations for sets, it made a larger speed difference than I would have thought possible. Way to go hash tables! I will definitively be using sets more often from now on.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

昇り龍 2024-08-24 01:33:09

您可以像这样创建一组电子邮件,然后:

# emails is a set of emails
for line in fileinput.input("csvfile.csv",inplace =1):
    parts = line.split(',')
    if parts[3] not in emails:
        print line

仅当 CSV 文件中的电子邮件始终位于位置 4 时,此方法才有效。

fileinput 启用就地编辑。

并使用一组电子邮件而不是亚伦所说的列表,不仅是因为速度,而且还可以消除重复。

You could create the set of emails as you do and then:

# emails is a set of emails
for line in fileinput.input("csvfile.csv",inplace =1):
    parts = line.split(',')
    if parts[3] not in emails:
        print line

This only works, if the email in the CSV file is always at position 4.

fileinput enables in place editing.

And use a set for the emails instead of a list as Aaron said, not only because of speed but also to eliminate duplicates.

浅语花开 2024-08-24 01:33:09

line.split() 在空白处分割。使用 line.split(',') 代替。

另外:行的顺序重要吗?如果没有,那么您应该真正使用 set() 而不是列表。这将使代码更快。

line.split() splits at whitespace. Use line.split(',') instead.

Also: Does the order of the lines matter? If not, then you should really use a set() instead of a list. That will make the code much faster.

噩梦成真你也成魔 2024-08-24 01:33:09

这是另一种方法,对电子邮件地址的位置进行简约检查。

import fileinput
emails=[]
for line in open("file1"):
    start=line.find("<")
    end=line.find(">")
    if start != -1 and end !=-1:
        emails.append(line[start+1:end])

for line in fileinput.FileInput("file2",inplace=1):
    p = line.split(",")
    for item in p:
        if "@" in item and item not in emails:
            print line.strip()

输出

$ ./python.py
156464,bob,otherguy,[email protected],45644562

here's another way, with minimalistic check on email addr's position.

import fileinput
emails=[]
for line in open("file1"):
    start=line.find("<")
    end=line.find(">")
    if start != -1 and end !=-1:
        emails.append(line[start+1:end])

for line in fileinput.FileInput("file2",inplace=1):
    p = line.split(",")
    for item in p:
        if "@" in item and item not in emails:
            print line.strip()

output

$ ./python.py
156464,bob,otherguy,[email protected],45644562
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文