跨多个 csv 文件匹配行并合并特定字段

发布于 2024-09-12 04:59:56 字数 1359 浏览 2 评论 0原文

我有大约 20 个 CSV，它们看起来都是这样的：

"[email]","[fname]","[lname]","[prefix]","[suffix]","[fax]","[phone]","[business]","[address1]","[address2]","[city]","[state]","[zip]","[setdate]","[email_type]","[start_code]"

我被告知需要生成完全相同的东西，但每个文件现在都包含电子邮件匹配的每个其他文件的 start_code。

如果任何其他字段不匹配也没关系，只是电子邮件字段很重要，对每个文件的唯一更改是添加电子邮件匹配的其他文件中的任何其他 start_code 值。

例如，如果同一电子邮件出现在 wicq.csv、oota.csv 和 itos.csv 中，则每个文件中的内容将从：

"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX"
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"OOTA"
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"ITOS"

变为

"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX, OOTA, ITOS"

所有三个文件（wicq.csv、oota.csv 和 itos）。 csv）

我可用的工具是 OS X 命令行（awk、sed 等）以及 perl，尽管我对两者都不太熟悉，但可能有更好的方法来做到这一点。

原文

I have about 20 CSV's that all look like this:

"[email]","[fname]","[lname]","[prefix]","[suffix]","[fax]","[phone]","[business]","[address1]","[address2]","[city]","[state]","[zip]","[setdate]","[email_type]","[start_code]"

What I've been told I need to produce is the exact same thing, but with each file now containing the start_code from every other file where the email matches.

It doesn't matter if any of the other fields don't match, just the email field is important, and the only change to each file would be to add any other start_code values from other files where the email matches.

For example, if the same email appeared in the wicq.csv, oota.csv, and itos.csv it would go from being the following in each file:

"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX"
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"OOTA"
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"ITOS"

"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX, OOTA, ITOS"

for all three files (wicq.csv, oota.csv, and itos.csv)

Tools I have available would be OS X command line (awk, sed, etc) as well as perl-though I'm not too familiar with either, and there may be a better way to do this.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

请你别敷衍 2024-09-19 04:59:56

use strict;
use warnings;
use Text::CSV_XS;

# Supply csv files as command line arguments.
my @csv_files = @ARGV;
my $parser    = Text::CSV_XS->new;

# In my test data, the email is the first field. The field
# to be merged is the second. Adjust accordingly.
my $EMAIL_i   = 0;
my $MERGE_i   = 1;

# Process all files, creating a set of key-value pairs:
#    $sc{EMAIL} = [ LIST OF VALUES OBSERVED IN THE MERGE FIELD ]
my %sc;
for my $cf (@csv_files){
    open(my $fh_in, '<', $cf) or die $!;

    while (my $line = <$fh_in>){
        die "Failed parse : $cf : $.\n" unless $parser->parse($line);
        my @fields = $parser->fields;
        push @{ $sc{$fields[$EMAIL_i]} }, $fields[$MERGE_i];
    }
}

# Process the files again, writing new output.
for my $cf (@csv_files){
    open(my $fh_in,  '<', $cf)             or die $!;
    open(my $fh_out, '>', "${cf}_new.csv") or die $!;

    while (my $line = <$fh_in>){
        die "Failed parse : $cf : $.\n" unless $parser->parse($line);
        my @fields = $parser->fields;

        $fields[$MERGE_i] = join ', ', @{ $sc{$fields[$EMAIL_i]} };

        $parser->print($fh_out, \@fields);
        print $fh_out "\n";
    }
}

use strict;
use warnings;
use Text::CSV_XS;

# Supply csv files as command line arguments.
my @csv_files = @ARGV;
my $parser    = Text::CSV_XS->new;

# In my test data, the email is the first field. The field
# to be merged is the second. Adjust accordingly.
my $EMAIL_i   = 0;
my $MERGE_i   = 1;

# Process all files, creating a set of key-value pairs:
#    $sc{EMAIL} = [ LIST OF VALUES OBSERVED IN THE MERGE FIELD ]
my %sc;
for my $cf (@csv_files){
    open(my $fh_in, '<', $cf) or die $!;

    while (my $line = <$fh_in>){
        die "Failed parse : $cf : $.\n" unless $parser->parse($line);
        my @fields = $parser->fields;
        push @{ $sc{$fields[$EMAIL_i]} }, $fields[$MERGE_i];
    }
}

# Process the files again, writing new output.
for my $cf (@csv_files){
    open(my $fh_in,  '<', $cf)             or die $!;
    open(my $fh_out, '>', "${cf}_new.csv") or die $!;

    while (my $line = <$fh_in>){
        die "Failed parse : $cf : $.\n" unless $parser->parse($line);
        my @fields = $parser->fields;

        $fields[$MERGE_i] = join ', ', @{ $sc{$fields[$EMAIL_i]} };

        $parser->print($fh_out, \@fields);
        print $fh_out "\n";
    }
}

回复收藏 0 原文

晨与橙与城 2024-09-19 04:59:56

我会通过执行以下操作来解决此问题：

cut -d ',' -f1,16 *.csv | 
    sort |
    awk -F, '{d=""; if (array[$1]) d=","; array[$1] = array[$1] d $2} END { for (i in array) print i "," array[i]}' |
    while IFS="," read -r email start; do sed -i "/^$email,/ s/,[^,]*\$/,$start/" *.csv; done

这将创建所有电子邮件的列表（cut/sort）和 start_codes 并合并（awk代码>）他们。然后它会替换 (sed) 每个文件中每个匹配电子邮件的 start_code (while)。

但我觉得必须有一种更有效的方法。

I would approach this by doing something along the lines of:

cut -d ',' -f1,16 *.csv | 
    sort |
    awk -F, '{d=""; if (array[$1]) d=","; array[$1] = array[$1] d $2} END { for (i in array) print i "," array[i]}' |
    while IFS="," read -r email start; do sed -i "/^$email,/ s/,[^,]*\$/,$start/" *.csv; done

This creates a list of all the emails (cut/sort) and start_codes and consolidates (awk) them. Then it replaces (sed) the start_code for each matching email in each file (while).

But I feel like there must be a more efficient way.

回复收藏 0 原文

悲歌长辞 2024-09-19 04:59:56

这是一个简单的 Perl 程序，可以实现您的需求。它依靠预先排序的事实对您的输入进行单次传递。

只要电子邮件不更改，它就会读取行并附加代码。当电子邮件更改时，它会打印记录（并修复代码字段中多余的双引号）。

#!/usr/bin/perl -l

use strict;
use warnings;

my $last_email = undef;
my @current_record = ();
my @fields = ();

sub print_record {
   # Remove repeated double quotes introduced when we appended the code
  $current_record[15] =~ s/""/, /g;
  print join ",", @current_record;
  @current_record = ();
} 

while (my $input_line = <>) {
  chomp $input_line;
  @fields = split ",", $input_line;

  # Print a record when the email we read changes. Avoid printing on the first
  # loop by checking we have read at least one email ($last_email is defined).
  defined $last_email && ($fields[0] ne $last_email) && print_record;

  if (!@current_record)  {
    # We are starting to process a new email. Grab all fields.
    @current_record = @fields;
  }
  else {
    # We have consecutive records with the same email. Append the code.
    $current_record[15] .= $fields[15];
  }

  # Remember the last processed email. When it changes we will print @current_record.
  $last_email = $fields[0];
}

# Print the last record
print_record

-l 开关会自动打印添加新行字符（无论操作系统是什么）。

像这样称呼它：

sort *.csv | ./script.pl

Here's a simple Perl program achieving what you need. It does a single pass on your input by relying on the fact that it is sorted beforehand.

It reads lines and appends the code at long as the email does not change. When the email changes, it prints the record (and fixes extra double quotes in the code field).

#!/usr/bin/perl -l

use strict;
use warnings;

my $last_email = undef;
my @current_record = ();
my @fields = ();

sub print_record {
   # Remove repeated double quotes introduced when we appended the code
  $current_record[15] =~ s/""/, /g;
  print join ",", @current_record;
  @current_record = ();
} 

while (my $input_line = <>) {
  chomp $input_line;
  @fields = split ",", $input_line;

  # Print a record when the email we read changes. Avoid printing on the first
  # loop by checking we have read at least one email ($last_email is defined).
  defined $last_email && ($fields[0] ne $last_email) && print_record;

  if (!@current_record)  {
    # We are starting to process a new email. Grab all fields.
    @current_record = @fields;
  }
  else {
    # We have consecutive records with the same email. Append the code.
    $current_record[15] .= $fields[15];
  }

  # Remember the last processed email. When it changes we will print @current_record.
  $last_email = $fields[0];
}

# Print the last record
print_record

The -l switch has print automatically add a new line char (whatever the os is).

Call it like this: