比较两个 CSV 文件并搜索相似项目
因此,我有两个 CSV 文件,我正在尝试比较它们并获取相似项目的结果。 第一个文件,hosts.csv 如下所示:
Path Filename Size Signature
C:\ a.txt 14kb 012345
D:\ b.txt 99kb 678910
C:\ c.txt 44kb 111213
第二个文件,masterlist.csv 如下所示:
Filename Signature
b.txt 678910
x.txt 111213
b.txt 777777
c.txt 999999
正如您所看到的,行不匹配,masterlist.csv 始终大于hosts.csv 文件。我想搜索的唯一部分是签名部分。我知道这看起来像:
hosts[3] == masterlist[1]
我正在寻找一个解决方案,它将给我类似以下的内容(基本上是带有新结果列的hosts.csv文件):
Path Filename Size Signature RESULTS
C:\ a.txt 14kb 012345 NOT FOUND in masterlist
D:\ b.txt 99kb 678910 FOUND in masterlist (row 1)
C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
我搜索了帖子并发现了类似于此的内容这里但我不太明白因为我还在学习Python。
编辑使用Python 2.6
So I've got two CSV files that I'm trying to compare and get the results of the similar items.
The first file, hosts.csv is shown below:
Path Filename Size Signature
C:\ a.txt 14kb 012345
D:\ b.txt 99kb 678910
C:\ c.txt 44kb 111213
The second file, masterlist.csv is shown below:
Filename Signature
b.txt 678910
x.txt 111213
b.txt 777777
c.txt 999999
As you can see the rows do not match up and the masterlist.csv is always larger than the hosts.csv file. The only portion that I'd like to search for is the Signature portion. I know this would look something like:
hosts[3] == masterlist[1]
I am looking for a solution that will give me something like the following (basically the hosts.csv file with a new RESULTS column):
Path Filename Size Signature RESULTS
C:\ a.txt 14kb 012345 NOT FOUND in masterlist
D:\ b.txt 99kb 678910 FOUND in masterlist (row 1)
C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
I've searched the posts and found something similar to this here but I don't quite understand it as I'm still learning python.
Edit Using Python 2.6
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
srgerg 的答案效率非常低,因为它的运行时间是二次方;这是一个线性时间解决方案,使用 Python 2.6 兼容语法:
这会生成一个字典,首先将 masterlist.csv 中的签名映射到行号。字典中的查找需要恒定的时间,使得对
hosts.csv
行的第二次循环独立于masterlist.csv
中的行数。更不用说代码要简单得多。对于使用 Python 3 的用户,上面只需要将
open()
调用调整为以文本模式打开(从文件模式中删除b
),并且您想要添加new line=''
以便 CSV 读取器可以控制行分隔符。您可能希望明确指定要使用的编码,而不是依赖于系统默认值(使用encoding=...
)。master_indices
映射可以使用字典理解来构建({r[1]: i for i, r in enumerate(csv.reader(master))}
)。The answer by srgerg is terribly inefficient, as it operates in quadratic time; here is a linear time solution instead, using Python 2.6-compatible syntax:
This produces a dictionary, mapping signatures from
masterlist.csv
to a line number first. Lookups in a dictionary take constant time, making the second loop overhosts.csv
rows independant from the number of rows inmasterlist.csv
. Not to mention code that's a lot simpler.For those using Python 3, the above only needs to have the
open()
calls adjusted to open in text mode (remove theb
from the file mode), and you want to addnew line=''
so the CSV reader can take control of line separators. You may want to state the encoding to use explicitly rather than rely on your system default (useencoding=...
). Themaster_indices
mapping can be built with a dictionary comprehension ({r[1]: i for i, r in enumerate(csv.reader(master))}
).编辑:虽然我的解决方案工作正常,但请查看下面 Martijn 的答案以获得更有效的解决方案。
您可以在此处找到 python CSV 模块的文档。
你正在寻找的是这样的:
Edit: While my solution works correctly, check out Martijn's answer below for a more efficient solution.
You can find the documentation for the python CSV module here.
What you're looking for is something like this:
Python 的 CSV 和集合模块,特别是 OrderedDict 在这里非常有帮助。您想使用 OrderedDict 来保留键的顺序等。您不必这样做,但它很有用!
以下是使用测试 CSV 文件的输出:
请原谅未对齐,它们是制表符分隔的:)
Python's CSV and collections module, specifically OrderedDict, are really helpful here. You want to use OrderedDict to preserve the order of the keys, etc. You don't have to, but it's useful!
Here's the output using your test CSV files:
Please excuse the misalignment, they are tab separated :)
csv
模块在解析 csv 文件时非常方便。但只是为了好玩,我只是将输入拆分为空格来获取数据。只需解析数据,为 masterlist.csv 中的数据构建一个
dict
,其中签名为键,行号为值。现在,对于hosts.csv的每一行,我们只需查询dict
并找出masterlist.csv中是否存在相应的条目,如果存在则在哪一行。The
csv
module comes in handy in parsing csv files. But just for fun, I am simply splitting the input on whitespace to get at the data.Just parse in the data, build a
dict
for the data in masterlist.csv with the signature as key and the line number as value. Now, for each row of hosts.csv, we can just query thedict
and find out whether or not a corresponding entry exists in masterlist.csv and if so at which line.我刚刚在 Martijn Pieters 代码中修复了一个小问题,以便使其在 Python 3 中工作,在这段代码中,我尝试将 file1
row[0]
中的第一列元素与file2row[0]
中的第一列元素。I just fixed a small thing in Martijn Pieters code in order to make it work in Python 3, and in this code, I am trying to match the first column elements in the file1
row[0]
with the first column elements in file2row[0]
.