跨多个 csv 文件匹配行并合并特定字段
我有大约 20 个 CSV,它们看起来都是这样的:
"[email]","[fname]","[lname]","[prefix]","[suffix]","[fax]","[phone]","[business]","[address1]","[address2]","[city]","[state]","[zip]","[setdate]","[email_type]","[start_code]"
我被告知需要生成完全相同的东西,但每个文件现在都包含电子邮件匹配的每个其他文件的 start_code。
如果任何其他字段不匹配也没关系,只是电子邮件字段很重要,对每个文件的唯一更改是添加电子邮件匹配的其他文件中的任何其他 start_code 值。
例如,如果同一电子邮件出现在 wicq.csv、oota.csv 和 itos.csv 中,则每个文件中的内容将从:
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX"
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"OOTA"
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"ITOS"
变为
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX, OOTA, ITOS"
所有三个文件(wicq.csv、oota.csv 和 itos)。 csv)
我可用的工具是 OS X 命令行(awk、sed 等)以及 perl,尽管我对两者都不太熟悉,但可能有更好的方法来做到这一点。
I have about 20 CSV's that all look like this:
"[email]","[fname]","[lname]","[prefix]","[suffix]","[fax]","[phone]","[business]","[address1]","[address2]","[city]","[state]","[zip]","[setdate]","[email_type]","[start_code]"
What I've been told I need to produce is the exact same thing, but with each file now containing the start_code from every other file where the email matches.
It doesn't matter if any of the other fields don't match, just the email field is important, and the only change to each file would be to add any other start_code values from other files where the email matches.
For example, if the same email appeared in the wicq.csv, oota.csv, and itos.csv it would go from being the following in each file:
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX"
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"OOTA"
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"ITOS"
to
"[email protected]","anon",,,,,,,,,,,,01/16/08 08:05 PM,,"WIQC PDX, OOTA, ITOS"
for all three files (wicq.csv, oota.csv, and itos.csv)
Tools I have available would be OS X command line (awk, sed, etc) as well as perl-though I'm not too familiar with either, and there may be a better way to do this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我会通过执行以下操作来解决此问题:
这将创建所有电子邮件的列表(
cut
/sort
)和 start_codes 并合并(awk
代码>)他们。然后它会替换 (sed
) 每个文件中每个匹配电子邮件的 start_code (while
)。但我觉得必须有一种更有效的方法。
I would approach this by doing something along the lines of:
This creates a list of all the emails (
cut
/sort
) and start_codes and consolidates (awk
) them. Then it replaces (sed
) the start_code for each matching email in each file (while
).But I feel like there must be a more efficient way.
这是一个简单的 Perl 程序,可以实现您的需求。它依靠预先排序的事实对您的输入进行单次传递。
只要电子邮件不更改,它就会读取行并附加代码。当电子邮件更改时,它会打印记录(并修复代码字段中多余的双引号)。
-l 开关会自动打印添加新行字符(无论操作系统是什么)。
像这样称呼它:
Here's a simple Perl program achieving what you need. It does a single pass on your input by relying on the fact that it is sorted beforehand.
It reads lines and appends the code at long as the email does not change. When the email changes, it prints the record (and fixes extra double quotes in the code field).
The -l switch has print automatically add a new line char (whatever the os is).
Call it like this: