比较两个不同文件中的字符串的脚本

发布于 2024-12-20 01:33:24 字数 1287 浏览 0 评论 0原文

我对 stackoverflow 和脚本编写都是全新的。我正在寻求帮助来开始编写脚本,而不一定是寻找有人来编写它。

这是我所拥有的: File1.csv - 包含一些信息,我只对 MAC 地址感兴趣。 File2.csv - 有一些不同的信息,但也包含 MAC 地址。

我需要一个脚本来解析 file1.csv 中的 MAC 地址,并在 file2.csv 中出现任何 MAC 地址时记录报告。

问题:

  1. 关于我使用的语言(最好是 Perl、Python 或 bash)有什么建议吗?

  2. 任何人都可以为所需的逻辑提出一些结构(即使只是在伪代码中)?

更新

使用@Adam Wagner 的方法,我真的很接近!

import csv
#Need to strip out NUL values from .csv file to make python happy
class FilteredFile(file):
        def next(self):
                return file.next(self).replace('\x00','').replace('\xff\xfe','')

reader = csv.reader(FilteredFile('wifi_clients.csv', 'rb'), delimiter=',', quotechar='|')
s1 = set(rec[0] for rec in reader)

inventory = csv.reader(FilteredFile('inventory.csv','rb'),delimiter=',')
s2 = set(rec[6] for rec in inventory)

shared_items = s1.intersection(s2)
print shared_items

这总是输出:(即使我修改 .csv 文件以使其具有匹配的 MAC 地址)

设置([])

csv 文件的内容

wifi_clients.csv

<块引用>

macNames、第一次看到、上次看到、电源、# 个数据包、BSSID、探测到的 ESSID

库存.csv

<块引用>

名称、制造商、设备类型、型号、序列号、IP 地址、MAC 地址...

I am brand new to stackoverflow and to scripting. I was looking for help to get started in a script, not necessarily looking for someone to write it.

Here's what I have:
File1.csv - contains some information, I am only interested in MAC addresses.
File2.csv - has some different information, but also contains MAC address.

I need a script that parses the MAC addresses from file1.csv and logs a report if any MAC address shows up in file2.csv.

The questions:

  1. Any tips on the language I use, preferably perl, python or bash?

  2. Can anyone suggest some structure for the logic needed (even if just in psuedo-code)?

update

Using @Adam Wagner's approach, I am really close!

import csv
#Need to strip out NUL values from .csv file to make python happy
class FilteredFile(file):
        def next(self):
                return file.next(self).replace('\x00','').replace('\xff\xfe','')

reader = csv.reader(FilteredFile('wifi_clients.csv', 'rb'), delimiter=',', quotechar='|')
s1 = set(rec[0] for rec in reader)

inventory = csv.reader(FilteredFile('inventory.csv','rb'),delimiter=',')
s2 = set(rec[6] for rec in inventory)

shared_items = s1.intersection(s2)
print shared_items

This always outputs:(even if I doctor the .csv files to have matching MAC addresses)

set([])

Contents of the csv files

wifi_clients.csv

macNames, First time seen, Last time seen,Power, # packets, BSSID, Probed ESSIDs

inventory.csv

Name,Manufacturer,Device Type,Model,Serial Number,IP Address,MAC Address,...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

与风相奔跑 2024-12-27 01:33:24

这是我采取的方法:

  1. 迭代每个 csv 文件(python 有一个方便的 csv 模块来完成此操作),捕获 mac 地址并将其放入一组(每个文件一个) )。再一次,Python 有一个很棒的内置 set 类型。 这是使用 csv 模块的一个很好的示例,当然,文档

  2. 接下来,可以得到set1(file1)和set2(file2)的交集。这将显示文件一和文件二中都存在的 mac 地址。

示例(Python 中):

s1 = set([1,2,3])  # You can add things incrementally with "s1.add(value)"
s2 = set([2,3,4])

shared_items = s1.intersection(s2)
print shared_items

输出:

set([2, 3])

记录这些共享项目可以通过打印(然后将输出重定向到文件)、使用 logging 模块、直接保存到文件等任何方式来完成。

我不确定您正在寻找的答案有多深入,但这应该可以帮助您开始。

更新:CSV/Set 使用示例

假设您有一个文件“foo.csv”,它看起来像这样:

bob,123,127.0.0.1,mac-address-1
fred,124,127.0.0.1,mac-address-2

构建该集合的最简单方法将是这样的:

import csv

set1 = set()
for record in csv.reader(open('foo.csv', 'rb')):
    user, machine_id, ip_address, mac_address = record
    set1.add(mac_address)
    # or simply "set1.add(record[3])", if you don't need the other fields.

显然,您会每个文件都需要这样的东西,因此您可能希望将其放入一个函数中以使生活更轻松。

最后,如果你想采用更简洁但更酷的 python 方式,你也可以像这样构建集合:

csvfile = csv.reader(open('foo.csv', 'rb'))
set1 = set(rec[3] for rec in csvfile)   # Assuming mac-address is the 4th column.

Here's the approach I'd take:

  1. Iterate over each csv file (python has a handy csv module for accomplishing this), capturing the mac-address and placing it in a set (one per file). And once again, python has a great builtin set type. Here's a good example of using the csv module and of-course, the docs.

  2. Next, you can get the intersection of set1 (file1) and set2 (file2). This will show you mac-addresses that exist in both files one and two.

Example (in python):

s1 = set([1,2,3])  # You can add things incrementally with "s1.add(value)"
s2 = set([2,3,4])

shared_items = s1.intersection(s2)
print shared_items

Which outputs:

set([2, 3])

Logging these shared items could be done with anything from printing (then redirecting output to a file), to using the logging module, to saving directly to a file.

I'm not sure how in-depth of an answer you were looking for, but this should get you started.

Update: CSV/Set usage example

Assuming you have a file "foo.csv", that looks something like this:

bob,123,127.0.0.1,mac-address-1
fred,124,127.0.0.1,mac-address-2

The simplest way to build the set, would be something like this:

import csv

set1 = set()
for record in csv.reader(open('foo.csv', 'rb')):
    user, machine_id, ip_address, mac_address = record
    set1.add(mac_address)
    # or simply "set1.add(record[3])", if you don't need the other fields.

Obviously, you'd need something like this for each file, so you may want to put this in a function to make life easier.

Finally, if you want to go the less-verbose-but-cooler-python-way, you could also build the set like this:

csvfile = csv.reader(open('foo.csv', 'rb'))
set1 = set(rec[3] for rec in csvfile)   # Assuming mac-address is the 4th column.
﹏半生如梦愿梦如真 2024-12-27 01:33:24

我强烈推荐 python 来做这件事。

因为你没有给出csv文件的结构,所以我只能展示一个框架:

def get_MAC_from_file1():
    ... parse the file to get MAC
    return a_MAC_list
def get_MAC_from_file2():
    ... parse the file to get MAC
    return a_MAC_list
def log_MACs():
    MAC_list1, MAC_list2 = get_MAC_from_file1(), get_MAC_from_file2()
    for a_MAC in MAC_list1:
        if a_MAC in MAC_list2:
            ...write your logs

如果数据集很大,使用字典或集合代替列表和相交操作。但由于它是 MAC 地址,我猜你的数据集没有那么大。所以保持脚本易于阅读是最重要的。

I strongly recommend python to do this.

'Cause you didn't give the structure of the csv file, I can only show a framework:

def get_MAC_from_file1():
    ... parse the file to get MAC
    return a_MAC_list
def get_MAC_from_file2():
    ... parse the file to get MAC
    return a_MAC_list
def log_MACs():
    MAC_list1, MAC_list2 = get_MAC_from_file1(), get_MAC_from_file2()
    for a_MAC in MAC_list1:
        if a_MAC in MAC_list2:
            ...write your logs

if the data set is large, use a dict or set instead of the list and the intersect operation. But as it's MAC address, I guess your dataset is not that large. So keeping the script easy to read is the most important thing.

过度放纵 2024-12-27 01:33:24

Awk 非常适合这个

{
   mac = $1  # assuming the mac addresses are in the first column
   do_grep = "grep " mac " otherfilename" # we'll use grep to check if the mac address is in the other file
   do_grep | getline mac_in_other_file  # pipe the output of the grep command into a new variable
   close(do_grep)  # close the pipe
   if(mac_in_other_file != ""){     # if grep found the mac address in the other file
     print mac > "naughty_macs.log"  # append the mac address to the log file
   }
}

然后你可以在第一个文件上运行它:
awk -flogging_script.awk mac_list.txt

(此代码未经测试,我不是最伟大的 awk 黑客,但它应该给出总体思路)

Awk is perfect for this

{
   mac = $1  # assuming the mac addresses are in the first column
   do_grep = "grep " mac " otherfilename" # we'll use grep to check if the mac address is in the other file
   do_grep | getline mac_in_other_file  # pipe the output of the grep command into a new variable
   close(do_grep)  # close the pipe
   if(mac_in_other_file != ""){     # if grep found the mac address in the other file
     print mac > "naughty_macs.log"  # append the mac address to the log file
   }
}

Then you'd run that on the first file:
awk -f logging_script.awk mac_list.txt

(this code is untested and I'm not the greatest awk hacker, but it should give the general idea)

丘比特射中我 2024-12-27 01:33:24

出于示例目的,生成 2 个与您的文件相似的文件。

文件 1:

for i in `seq 100`; do 
   echo -e "user$i\tmachine$i\t192.168.0.$i\tmac$i"; 
done > file1.csv

文件 2(包含编号为 1-200 的“mac 地址”的随机条目)

for j in `seq 100`; do 
    i=$(($RANDOM % 200)) ; 
    echo -e "mac$i\tmachine$i\tuser$i"; 
done > file2.csv

最简单的方法是使用 join 命令并在适当的字段上进行连接。这种方法的优点是两个文件中的字段都可以在输出中使用。

根据上面的示例文件,命令将如下所示:

join -1 4 -2 1 <(sort -k4 file1.csv)  <(sort -k1 file2.csv)

join 需要按您匹配的字段对输入进行排序,这就是排序的原因(-k 告诉使用哪一列)
如果第一个文件中的第 4 列与第二个文件中的第 1 列相等,上面的命令会将 file1.csv 中的行与 file2.csv 中的行进行匹配。

如果您只需要特定字段,则可以指定 join 命令的输出格式:

join -1 4 -2 1 -o1.4 1.2 <(sort -k4 file1.csv)  <(sort -k1 file2.csv)

这将仅打印第一个文件中的 mac 地址和机器字段。

如果您只需要匹配的 mac 地址列表,可以使用 uniq 或 sort -u。由于 join 输出将按 mac 排序,uniq 更快。但如果您需要另一个字段的唯一列表,则 sort -u 更好。

如果您只需要匹配的 mac 地址,grep 可以接受文件中的模式,并且您可以使用 cut 只提取第四个字段。

fgrep -f<(cut -f4 file1.csv) file2.csv

上面将列出 file2.csv 中包含 file1 中的 mac 地址的所有行
请注意,我使用的 fgrep 不进行模式匹配。另外,如果 file1 很大,这可能比第一种方法慢。另外,它假设 mac 仅存在于 file2 的 field1 中,其他字段不包含 mac 地址。
如果您只需要 mac,您可以在 fgrep 上使用 -o 选项,但有一些 grep 变体没有它,或者您可以通过管道传输输出槽切,然后排序 -u

fgrep -f<(cut -f4 file1.csv) file2.csv | cut -f1 | sort -u

这将是 bash 方式。

Python 和 awk 提示已在上面显示,我将尝试一下 perl:

#!/usr/bin/perl -w

use strict;

open F1, $ARGV[0];
my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
close F1;

open F2, $ARGV[1];

while (<F2>) {
    print if $searched_mac_addresses{(split "\t")[0]}
}

close F2

首先创建一个包含第一个文件中所有 mac 地址的字典:

my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
  • 从文件中读取所有行1
  • chomp 删除行尾
  • split 基于行拆分在选项卡上,如果需要,您可以使用更复杂的正则表达式
  • () 围绕 split 强制数组上下文
  • [3] 选择第四个字段
  • 映射,为数组的所有元素运行一段代码
  • =>生成一个字典(perl 术语中的散列)元素而不是数组

然后逐行读取第二个文件,并检查 mac 是否存在于上面的字典中:

while (<F2>) {
    print if $searched_mac_addresses{(split "\t")[0]}
}
  • while() 将读取文件 F2,并将每一行放入$_ 变量
  • print 不带任何参数 打印默认变量 $_
  • if can postfix 指令
  • 字典元素可以通过 {}
  • split by default 分割 $_ 默认变量

For the example purpose generate 2 files that that look like yours.

File1:

for i in `seq 100`; do 
   echo -e "user$i\tmachine$i\t192.168.0.$i\tmac$i"; 
done > file1.csv

File2 (contains random entries of "mac addresses" numbered from 1-200)

for j in `seq 100`; do 
    i=$(($RANDOM % 200)) ; 
    echo -e "mac$i\tmachine$i\tuser$i"; 
done > file2.csv

Simplest approach would be to use join command and do a join on the appropriate field. This approach has the advantage that fields from both files would be available in the output.

Based on the example files above, the command would look like this:

join -1 4 -2 1 <(sort -k4 file1.csv)  <(sort -k1 file2.csv)

join needs the input to be sorted by the field you are matching, that's why the sort is there (-k tells which column to use)
The command above matches rows from file1.csv with rows from file2.csv if column 4 in the first file is equal with column 1 from the second file.

If you only need specific fields, you can specify the output format to the join command:

join -1 4 -2 1 -o1.4 1.2 <(sort -k4 file1.csv)  <(sort -k1 file2.csv)

This would print only the mac address and the machine field from the first file.

If you only need a list of matching mac addresses, you can use uniq or sort -u. Since the join output will be sorted by mac, uniq is faster. But if you need a unique list of another field, sort -u is better.

If you only need the mac addresses that match, grep can accept patterns from a file, and you can use cut to extract only the forth field.

fgrep -f<(cut -f4 file1.csv) file2.csv

The above would list all the lines in file2.csv that contain a mac address from file1
Note that I'm using fgrep which doesn't do pattern matching. Also, if file1 is big, this may be slower than the first approach. Also, it assumes that the mac is present only in the field1 of file2 and the other fields don't contain mac addresses.
If you only need the mac, you can either use -o option on fgrep but there are grep variants that don't have it, or you can pipe the output trough cut and then sort -u

fgrep -f<(cut -f4 file1.csv) file2.csv | cut -f1 | sort -u

This would be the bash way.

Python and awk hints have been shown above, I will take a stab at perl:

#!/usr/bin/perl -w

use strict;

open F1, $ARGV[0];
my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
close F1;

open F2, $ARGV[1];

while (<F2>) {
    print if $searched_mac_addresses{(split "\t")[0]}
}

close F2

First you create a dictionary containing all the mac addresses from the first file:

my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
  • reads all the lines from the file1
  • chomp removes the end of line
  • split splits the line based on tab, you can use a more complex regexp if needed
  • () around split force an array context
  • [3] selects the forth field
  • map runs a piece of code for all elements of the array
  • => generates a dictionary (hash in perl's terminology) element instead of an array

Then you read line by line the second file, and check if the mac exists in the above dictionary:

while (<F2>) {
    print if $searched_mac_addresses{(split "\t")[0]}
}
  • while () will read the file F2, and put each line in the $_ variable
  • print without any parameters prints the default variable $_
  • if can postfix a instruction
  • dictionary elements can be accessed via {}
  • split by default splits the $_ default variable
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文