在 Unix 中使用正则表达式连接两个文件(最好使用 perl)

发布于 2024-11-18 19:16:10 字数 1247 浏览 2 评论 0原文

我有以下两个文件disconnect.txt 和answer.txt:

2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032

disconnect.txtanswer.txt

2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

想根据from: 和to: 字段在这些文件上创建一个联接,并且输出应该与answer.txt 中的匹配字段。例如,在上面的两个文件中,输出将是:

2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

我目前正在通过将文件 1 中的每一行与文件 2 中的每一行进行比较来做到这一点,但想知道是否存在有效的方法(这些文件将有数十个)千兆字节)。

谢谢

I have following two files disconnect.txt and answered.txt:

disconnect.txt

2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032

answered.txt

2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

I would like to create a join on these files based on the from: and to: fields and the output should be matching field from answered.txt. For example, in the above two files, the output would be:

2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

I'm currently doing it by comparing each line in file 1 with each line in file 2, but want to know if an efficient way exists (these files will be in tens of gigabytes).

Thank you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

烂人 2024-11-25 19:16:10

听起来你有数亿行?

除非文件的排序方式使得 from: 和 to: 的顺序至少模糊地相关,否则这是数据库的工作。

Sounds like you have hundreds of millions of lines?

Unless the files are sorted in such a way that you can expect the order of the from: and to: to at least vaguely correlate, this is a job for a database.

拥有 2024-11-25 19:16:10

如果文件很大,二次算法将花费一生的时间。

下面是一个 Ruby 脚本,它在answer.txt 中每行仅使用一个哈希表查找:

def key s
  s.split('from:')[1].split('to:').map(&:strip).join('.')
end

h = {}
open 'disconnect.txt', 'r' do |f|
  while s = f.gets
    h[key(s)] = true
  end
end

open 'answered.txt', 'r' do |f|
  while a = f.gets
    puts a if h[key(a)]
  end
end

就像 ysth 所说,这一切取决于disconnect.txt 中的行数。如果这是一个非常大的1数字,那么您可能无法在内存中容纳所有键,并且您将需要一个数据库。


1. disconnect.txt 中的行数乘以(大约)64 应小于计算机中的内存量。

If the files are large the quadratic algorithm will take a lifetime.

Here is a Ruby script that uses just a single hash table lookup per line in answered.txt:

def key s
  s.split('from:')[1].split('to:').map(&:strip).join('.')
end

h = {}
open 'disconnect.txt', 'r' do |f|
  while s = f.gets
    h[key(s)] = true
  end
end

open 'answered.txt', 'r' do |f|
  while a = f.gets
    puts a if h[key(a)]
  end
end

Like ysth says, it all depends on the number of lines in disconnect.txt. If that's a really big1 number, then you will probably not be able to fit all the keys in memory and you will need a database.


1. The number of lines in disconnect.txt multiplied by (roughly) 64 should be less than the amount of memory in your machine.

迟月 2024-11-25 19:16:10

首先,如果文件尚未按“从/到”时间戳排序,则对文件进行排序。 (是的,我知道从/到似乎存储为纪元秒,但这仍然是一个时间戳。)

然后获取排序后的文件并比较每个文件的第一行。

  • 如果时间戳相同,则匹配。万岁!在一个或两个文件中前进一行(取决于每个文件中重复时间戳的规则)并再次比较。
  • 如果不是,则抓取具有较早时间戳的文件中的下一行并再次比较。

这是比较两个(或更多)排序文件的最快方法,并且它保证不会从磁盘读取任何行超过一次。

如果您的文件未正确排序,那么对于“每个数十 GB”大小范围内的文件,初始排序操作可能会有些昂贵,但是:

  1. 您可以将文件拆分为任意大小的块(理想情况下每个块足够小)以适应内存),独立对每个块进行排序,然后将上述算法从两个文件推广到所需的数量。
  2. 即使您不这样做,并且您处理了对大于可用内存的文件进行排序所涉及的磁盘抖动,排序然后对每个文件执行一次传递仍然会比任何文件快很多涉及笛卡尔连接的解决方案。

或者您可以只使用前面答案中提到的数据库。在大多数(如果不是全部)情况下,上述方法会更有效,但是基于数据库的解决方案会更容易编写,并且还可以为以其他方式分析数据提供很大的灵活性,而无需进行完整的扫描每次您需要访问每个文件中的任何内容时。

First, sort the files on the from/to timestamps if they are not already sorted that way. (Yes, I know the from/to appear to be stored as epoch seconds, but that's still a timestamp.)

Then take the sorted files and compare the first lines of each.

  • If the timestamps are the same, you have a match. Hooray! Advance a line in one or both files (depending on your rules for duplicate timestamps in each) and compare again.
  • If not, grab the next line in whichever file has the earlier timestamp and compare again.

This is the fastest way to compare two (or more) sorted files and it guarantees that no line will be read from disk more than once.

If your files aren't appropriately sorted, then the initial sorting operation may be somewhat expensive on files in the "tens of gigabytes each" size range, but:

  1. You can split the files into arbitrarily-sized chunks (ideally small enough for each chunk to fit into memory), sort each chunk independently, and then generalize the above algorithm from two files to as many as are necessary.
  2. Even if you don't do that and you deal with the disk thrashing involved with sorting files larger than the available memory, sorting and then doing a single pass over each file will still be a lot faster than any solution involving a cartesian join.

Or you could just use a database as mentioned in previous answers. The above method will be more efficient in most, if not all, cases, but a database-based solution would be easier to write and would also provide a lot of flexibility for analyzing your data in other ways without needing to do a complete scan through each file every time you need to access anything in it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文