在 Unix 中使用正则表达式连接两个文件（最好使用 perl）

发布于 2024-11-18 19:16:10 字数 1247 浏览 2 评论 0原文

我有以下两个文件disconnect.txt 和answer.txt：

我

2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032

disconnect.txtanswer.txt

2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

想根据from: 和to: 字段在这些文件上创建一个联接，并且输出应该与answer.txt 中的匹配字段。例如，在上面的两个文件中，输出将是：

2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

我目前正在通过将文件 1 中的每一行与文件 2 中的每一行进行比较来做到这一点，但想知道是否存在有效的方法（这些文件将有数十个）千兆字节）。

谢谢

原文

I have following two files disconnect.txt and answered.txt:

disconnect.txt

2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032

answered.txt

2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

I would like to create a join on these files based on the from: and to: fields and the output should be matching field from answered.txt. For example, in the above two files, the output would be:

2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

I'm currently doing it by comparing each line in file 1 with each line in file 2, but want to know if an efficient way exists (these files will be in tens of gigabytes).

Thank you

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烂人 2024-11-25 19:16:10

听起来你有数亿行？

除非文件的排序方式使得 from: 和 to: 的顺序至少模糊地相关，否则这是数据库的工作。

回复收藏 0 原文

拥有 2024-11-25 19:16:10

如果文件很大，二次算法将花费一生的时间。

下面是一个 Ruby 脚本，它在answer.txt 中每行仅使用一个哈希表查找：

def key s
  s.split('from:')[1].split('to:').map(&:strip).join('.')
end

h = {}
open 'disconnect.txt', 'r' do |f|
  while s = f.gets
    h[key(s)] = true
  end
end

open 'answered.txt', 'r' do |f|
  while a = f.gets
    puts a if h[key(a)]
  end
end

就像 ysth 所说，这一切取决于disconnect.txt 中的行数。如果这是一个非常大的¹数字，那么您可能无法在内存中容纳所有键，并且您将需要一个数据库。

^{1. disconnect.txt 中的行数乘以（大约）64 应小于计算机中的内存量。}

If the files are large the quadratic algorithm will take a lifetime.

Here is a Ruby script that uses just a single hash table lookup per line in answered.txt:

def key s
  s.split('from:')[1].split('to:').map(&:strip).join('.')
end

h = {}
open 'disconnect.txt', 'r' do |f|
  while s = f.gets
    h[key(s)] = true
  end
end

open 'answered.txt', 'r' do |f|
  while a = f.gets
    puts a if h[key(a)]
  end
end

Like ysth says, it all depends on the number of lines in disconnect.txt. If that's a really big¹ number, then you will probably not be able to fit all the keys in memory and you will need a database.

^{1. The number of lines in disconnect.txt multiplied by (roughly) 64 should be less than the amount of memory in your machine.}

回复收藏 0 原文