在 Unix 中使用正则表达式连接两个文件(最好使用 perl)
我有以下两个文件disconnect.txt 和answer.txt:
我
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032
disconnect.txtanswer.txt
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
想根据from: 和to: 字段在这些文件上创建一个联接,并且输出应该与answer.txt 中的匹配字段。例如,在上面的两个文件中,输出将是:
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
我目前正在通过将文件 1 中的每一行与文件 2 中的每一行进行比较来做到这一点,但想知道是否存在有效的方法(这些文件将有数十个)千兆字节)。
谢谢
I have following two files disconnect.txt and answered.txt:
disconnect.txt
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032
answered.txt
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I would like to create a join on these files based on the from: and to: fields and the output should be matching field from answered.txt. For example, in the above two files, the output would be:
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I'm currently doing it by comparing each line in file 1 with each line in file 2, but want to know if an efficient way exists (these files will be in tens of gigabytes).
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
听起来你有数亿行?
除非文件的排序方式使得 from: 和 to: 的顺序至少模糊地相关,否则这是数据库的工作。
Sounds like you have hundreds of millions of lines?
Unless the files are sorted in such a way that you can expect the order of the from: and to: to at least vaguely correlate, this is a job for a database.
如果文件很大,二次算法将花费一生的时间。
下面是一个 Ruby 脚本,它在answer.txt 中每行仅使用一个哈希表查找:
就像 ysth 所说,这一切取决于disconnect.txt 中的行数。如果这是一个非常大的1数字,那么您可能无法在内存中容纳所有键,并且您将需要一个数据库。
1. disconnect.txt 中的行数乘以(大约)64 应小于计算机中的内存量。
If the files are large the quadratic algorithm will take a lifetime.
Here is a Ruby script that uses just a single hash table lookup per line in answered.txt:
Like ysth says, it all depends on the number of lines in disconnect.txt. If that's a really big1 number, then you will probably not be able to fit all the keys in memory and you will need a database.
1. The number of lines in disconnect.txt multiplied by (roughly) 64 should be less than the amount of memory in your machine.
首先,如果文件尚未按“从/到”时间戳排序,则对文件进行排序。 (是的,我知道从/到似乎存储为纪元秒,但这仍然是一个时间戳。)
然后获取排序后的文件并比较每个文件的第一行。
这是比较两个(或更多)排序文件的最快方法,并且它保证不会从磁盘读取任何行超过一次。
如果您的文件未正确排序,那么对于“每个数十 GB”大小范围内的文件,初始排序操作可能会有些昂贵,但是:
或者您可以只使用前面答案中提到的数据库。在大多数(如果不是全部)情况下,上述方法会更有效,但是基于数据库的解决方案会更容易编写,并且还可以为以其他方式分析数据提供很大的灵活性,而无需进行完整的扫描每次您需要访问每个文件中的任何内容时。
First, sort the files on the from/to timestamps if they are not already sorted that way. (Yes, I know the from/to appear to be stored as epoch seconds, but that's still a timestamp.)
Then take the sorted files and compare the first lines of each.
This is the fastest way to compare two (or more) sorted files and it guarantees that no line will be read from disk more than once.
If your files aren't appropriately sorted, then the initial sorting operation may be somewhat expensive on files in the "tens of gigabytes each" size range, but:
Or you could just use a database as mentioned in previous answers. The above method will be more efficient in most, if not all, cases, but a database-based solution would be easier to write and would also provide a lot of flexibility for analyzing your data in other ways without needing to do a complete scan through each file every time you need to access anything in it.