比较两个不同文件中的字符串的脚本
我对 stackoverflow 和脚本编写都是全新的。我正在寻求帮助来开始编写脚本,而不一定是寻找有人来编写它。
这是我所拥有的: File1.csv - 包含一些信息,我只对 MAC 地址感兴趣。 File2.csv - 有一些不同的信息,但也包含 MAC 地址。
我需要一个脚本来解析 file1.csv 中的 MAC 地址,并在 file2.csv 中出现任何 MAC 地址时记录报告。
问题:
关于我使用的语言(最好是 Perl、Python 或 bash)有什么建议吗?
任何人都可以为所需的逻辑提出一些结构(即使只是在伪代码中)?
更新
使用@Adam Wagner 的方法,我真的很接近!
import csv
#Need to strip out NUL values from .csv file to make python happy
class FilteredFile(file):
def next(self):
return file.next(self).replace('\x00','').replace('\xff\xfe','')
reader = csv.reader(FilteredFile('wifi_clients.csv', 'rb'), delimiter=',', quotechar='|')
s1 = set(rec[0] for rec in reader)
inventory = csv.reader(FilteredFile('inventory.csv','rb'),delimiter=',')
s2 = set(rec[6] for rec in inventory)
shared_items = s1.intersection(s2)
print shared_items
这总是输出:(即使我修改 .csv 文件以使其具有匹配的 MAC 地址)
设置([])
csv 文件的内容
wifi_clients.csv
<块引用>macNames、第一次看到、上次看到、电源、# 个数据包、BSSID、探测到的 ESSID
库存.csv
<块引用>名称、制造商、设备类型、型号、序列号、IP 地址、MAC 地址...
I am brand new to stackoverflow and to scripting. I was looking for help to get started in a script, not necessarily looking for someone to write it.
Here's what I have:
File1.csv - contains some information, I am only interested in MAC addresses.
File2.csv - has some different information, but also contains MAC address.
I need a script that parses the MAC addresses from file1.csv and logs a report if any MAC address shows up in file2.csv.
The questions:
Any tips on the language I use, preferably perl, python or bash?
Can anyone suggest some structure for the logic needed (even if just in psuedo-code)?
update
Using @Adam Wagner's approach, I am really close!
import csv
#Need to strip out NUL values from .csv file to make python happy
class FilteredFile(file):
def next(self):
return file.next(self).replace('\x00','').replace('\xff\xfe','')
reader = csv.reader(FilteredFile('wifi_clients.csv', 'rb'), delimiter=',', quotechar='|')
s1 = set(rec[0] for rec in reader)
inventory = csv.reader(FilteredFile('inventory.csv','rb'),delimiter=',')
s2 = set(rec[6] for rec in inventory)
shared_items = s1.intersection(s2)
print shared_items
This always outputs:(even if I doctor the .csv files to have matching MAC addresses)
set([])
Contents of the csv files
wifi_clients.csv
macNames, First time seen, Last time seen,Power, # packets, BSSID, Probed ESSIDs
inventory.csv
Name,Manufacturer,Device Type,Model,Serial Number,IP Address,MAC Address,...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是我采取的方法:
迭代每个 csv 文件(python 有一个方便的
csv
模块来完成此操作),捕获 mac 地址并将其放入一组(每个文件一个) )。再一次,Python 有一个很棒的内置set
类型。 这是使用csv
模块的一个很好的示例,当然,文档。接下来,可以得到set1(file1)和set2(file2)的
交集
。这将显示文件一和文件二中都存在的 mac 地址。示例(Python 中):
输出:
记录这些共享项目可以通过打印(然后将输出重定向到文件)、使用
logging
模块、直接保存到文件等任何方式来完成。我不确定您正在寻找的答案有多深入,但这应该可以帮助您开始。
更新:CSV/Set 使用示例
假设您有一个文件“foo.csv”,它看起来像这样:
构建该集合的最简单方法将是这样的:
显然,您会每个文件都需要这样的东西,因此您可能希望将其放入一个函数中以使生活更轻松。
最后,如果你想采用更简洁但更酷的 python 方式,你也可以像这样构建集合:
Here's the approach I'd take:
Iterate over each csv file (python has a handy
csv
module for accomplishing this), capturing the mac-address and placing it in a set (one per file). And once again, python has a great builtinset
type. Here's a good example of using thecsv
module and of-course, the docs.Next, you can get the
intersection
of set1 (file1) and set2 (file2). This will show you mac-addresses that exist in both files one and two.Example (in python):
Which outputs:
Logging these shared items could be done with anything from printing (then redirecting output to a file), to using the
logging
module, to saving directly to a file.I'm not sure how in-depth of an answer you were looking for, but this should get you started.
Update: CSV/Set usage example
Assuming you have a file "foo.csv", that looks something like this:
The simplest way to build the set, would be something like this:
Obviously, you'd need something like this for each file, so you may want to put this in a function to make life easier.
Finally, if you want to go the less-verbose-but-cooler-python-way, you could also build the set like this:
我强烈推荐 python 来做这件事。
因为你没有给出csv文件的结构,所以我只能展示一个框架:
如果数据集很大,使用字典或集合代替列表和相交操作。但由于它是 MAC 地址,我猜你的数据集没有那么大。所以保持脚本易于阅读是最重要的。
I strongly recommend python to do this.
'Cause you didn't give the structure of the csv file, I can only show a framework:
if the data set is large, use a dict or set instead of the list and the intersect operation. But as it's MAC address, I guess your dataset is not that large. So keeping the script easy to read is the most important thing.
Awk 非常适合这个
然后你可以在第一个文件上运行它:
awk -flogging_script.awk mac_list.txt
(此代码未经测试,我不是最伟大的 awk 黑客,但它应该给出总体思路)
Awk is perfect for this
Then you'd run that on the first file:
awk -f logging_script.awk mac_list.txt
(this code is untested and I'm not the greatest awk hacker, but it should give the general idea)
出于示例目的,生成 2 个与您的文件相似的文件。
文件 1:
文件 2(包含编号为 1-200 的“mac 地址”的随机条目)
最简单的方法是使用 join 命令并在适当的字段上进行连接。这种方法的优点是两个文件中的字段都可以在输出中使用。
根据上面的示例文件,命令将如下所示:
join 需要按您匹配的字段对输入进行排序,这就是排序的原因(-k 告诉使用哪一列)
如果第一个文件中的第 4 列与第二个文件中的第 1 列相等,上面的命令会将 file1.csv 中的行与 file2.csv 中的行进行匹配。
如果您只需要特定字段,则可以指定 join 命令的输出格式:
这将仅打印第一个文件中的 mac 地址和机器字段。
如果您只需要匹配的 mac 地址列表,可以使用 uniq 或 sort -u。由于 join 输出将按 mac 排序,uniq 更快。但如果您需要另一个字段的唯一列表,则 sort -u 更好。
如果您只需要匹配的 mac 地址,grep 可以接受文件中的模式,并且您可以使用 cut 只提取第四个字段。
上面将列出 file2.csv 中包含 file1 中的 mac 地址的所有行
请注意,我使用的 fgrep 不进行模式匹配。另外,如果 file1 很大,这可能比第一种方法慢。另外,它假设 mac 仅存在于 file2 的 field1 中,其他字段不包含 mac 地址。
如果您只需要 mac,您可以在 fgrep 上使用 -o 选项,但有一些 grep 变体没有它,或者您可以通过管道传输输出槽切,然后排序 -u
这将是 bash 方式。
Python 和 awk 提示已在上面显示,我将尝试一下 perl:
首先创建一个包含第一个文件中所有 mac 地址的字典:
然后逐行读取第二个文件,并检查 mac 是否存在于上面的字典中:
For the example purpose generate 2 files that that look like yours.
File1:
File2 (contains random entries of "mac addresses" numbered from 1-200)
Simplest approach would be to use join command and do a join on the appropriate field. This approach has the advantage that fields from both files would be available in the output.
Based on the example files above, the command would look like this:
join needs the input to be sorted by the field you are matching, that's why the sort is there (-k tells which column to use)
The command above matches rows from file1.csv with rows from file2.csv if column 4 in the first file is equal with column 1 from the second file.
If you only need specific fields, you can specify the output format to the join command:
This would print only the mac address and the machine field from the first file.
If you only need a list of matching mac addresses, you can use uniq or sort -u. Since the join output will be sorted by mac, uniq is faster. But if you need a unique list of another field, sort -u is better.
If you only need the mac addresses that match, grep can accept patterns from a file, and you can use cut to extract only the forth field.
The above would list all the lines in file2.csv that contain a mac address from file1
Note that I'm using fgrep which doesn't do pattern matching. Also, if file1 is big, this may be slower than the first approach. Also, it assumes that the mac is present only in the field1 of file2 and the other fields don't contain mac addresses.
If you only need the mac, you can either use -o option on fgrep but there are grep variants that don't have it, or you can pipe the output trough cut and then sort -u
This would be the bash way.
Python and awk hints have been shown above, I will take a stab at perl:
First you create a dictionary containing all the mac addresses from the first file:
Then you read line by line the second file, and check if the mac exists in the above dictionary: