Unix 使用 awk 使用正则表达式连接两个文件
我有一个文件(lookup.txt),其中包含一个查找表,该表由正则表达式列表以及相应的数据(类别和周期)组成。例如,
INTERNODE|household/bills/broadband|monthly
ORIGIN ENERGY|household/bills/electricity|quarterly
TELSTRA.*BILL|household/bills/phone|quarterly
OPTUS|household/bills/mobile|quarterly
SKYPE|household/bills/skype|non-periodic
我有另一个文件(data.txt),其中包含费用列表,例如:
2009-10-31,cc,-39.9,INTERNODE BROADBAND
2009-10-31,cc,-50,ORIGIN ENERGY 543546
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES
2009-10-31,cc,-90,TELSTRA MOBILE BILL
2009-11-02,cc,-320,TELSTRA HOME BILL
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN
我想将这两个文件连接在一起,其中 data.txt 文件中的第四列与lookup.txt 文件第一列中的正则表达式匹配。
所以输出是:
2009-10-31,cc,-39.9,INTERNODE BROADBAND,household/bills/broadband,monthly
2009-10-31,cc,-50,ORIGIN ENERGY 543546,household/bills/electricity,quarterly
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES,household/bills/broadband,monthly
2009-10-31,cc,-90,TELSTRA MOBILE BILL,household/bills/phone,quarterly
2009-11-02,cc,-320,TELSTRA HOME BILL,household/bills/phone,quarterly
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN
我已经使用 bash 循环实现了这一点,循环查找,执行 grep 并使用 sed 添加额外的列,但它非常慢。所以想知道是否有更快的方法来做到这一点,比如使用 awk。
任何帮助将不胜感激。
I have one file (lookup.txt) which contains a lookup table consisting of a list of regular expressions, with corresponding data (categories, and periods). e.g.
INTERNODE|household/bills/broadband|monthly
ORIGIN ENERGY|household/bills/electricity|quarterly
TELSTRA.*BILL|household/bills/phone|quarterly
OPTUS|household/bills/mobile|quarterly
SKYPE|household/bills/skype|non-periodic
I have another file (data.txt) which contains a list of expenses, eg:
2009-10-31,cc,-39.9,INTERNODE BROADBAND
2009-10-31,cc,-50,ORIGIN ENERGY 543546
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES
2009-10-31,cc,-90,TELSTRA MOBILE BILL
2009-11-02,cc,-320,TELSTRA HOME BILL
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN
I want to join these two together, whereby the 4th column in data.txt file matches the regular expression from the first column of the lookup.txt file.
So the output would be:
2009-10-31,cc,-39.9,INTERNODE BROADBAND,household/bills/broadband,monthly
2009-10-31,cc,-50,ORIGIN ENERGY 543546,household/bills/electricity,quarterly
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES,household/bills/broadband,monthly
2009-10-31,cc,-90,TELSTRA MOBILE BILL,household/bills/phone,quarterly
2009-11-02,cc,-320,TELSTRA HOME BILL,household/bills/phone,quarterly
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN
I've acheived this using a bash loop, looping over the lookup, doing greps and adding extra columns on using sed, but it is very slow. So was wondering if there was a faster method of doing this, saying using awk.
Any help would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以在 Python 中执行此操作:
对于您的文件
lookup.txt
和data.txt
,它会在不到 0.3 秒的时间内返回以下内容:You can do it in Python:
For your files
lookup.txt
anddata.txt
it returns the following in less than 0.3s:如果您没有正则表达式,可以使用
join
。lookup.txt
有多少个正则表达式?如果只是这个功能,只需扩展它并删除该功能即可。If you didn't have the regexs, you could use
join
. How many regexps doeslookup.txt
have? If it's just that one, just expand it and drop that feature.Awk 实际上是为一次处理一条记录的单个数据流而设计的,因此它不是适合这项工作的工具。这将是用 Perl 或其他更面向通用编程的语言进行的十分钟练习。
如果您打算在 awk 中完成所有操作,请编写一个脚本以从处理数据的查找文件生成第二个 awk 脚本,然后运行第二个脚本。
Awk is really designed to process a single stream of data one record at a time, so it isn't the right tool for this job. It would be a ten-minute exercise in Perl or another language that's more oriented toward general-purpose programming.
If you're bent on doing it all in awk, write one script to generate a second awk script from your lookup file that processes the data, then run the second script.
你可以用 Perl 来做。 Perl(或 Python)的优点是它们具有处理 CSV 文件的库。您的示例很简单,但是如果双引号内有逗号会发生什么?或者utf8呢?等。
标准 Perl 库是 Text:CSV_XS 。然而,它有点冗长,我更喜欢 Parse:: CSV 是 Text::CSV_XS 的包装器。
You can do it in Perl. The advantage of Perl (or Python) is they have libraries for dealing with CSV files. Your examples are simple enough, but what happens if you have a comma inside double quotes? Or what about utf8? etc.
The standard Perl library for this is Text:CSV_XS. However, its a bit verbose and I prefer Parse::CSV which is a wrapper around Text::CSV_XS.