如何在 Perl 中执行类似 SQL 的 Join?

发布于 2024-12-24 02:48:32 字数 461 浏览 2 评论 0原文

我必须通过组合两个不同的文件来处理一些数据。它们都有两列,它们将形成一个主键,我可以用它来并排匹配它们。问题中的文件很大(大约 5GB,有 2000 万行),所以我需要一个高效的代码。我该如何在 Perl 中做到这一点?

我举一个例子:

如果文件 A 包含列

id, name, lastname, dob, school

文件 B 包含列,

address, id, postcode, dob, email

我需要通过匹配两个文件中的 iddob 来连接这两个文件以获得输出文件将会有以下列:

 id, name, lastname, dob, school, address, postcode, email

I have to process some data by combining two different files. Both of them have two columns that would form a primary key that I can use to match them side-by-side. The files in questions are huge (around 5GB with 20 million rows) so I would need an efficient code. How would I do this in Perl?

I give an example:

If File A contains columns

id, name, lastname, dob, school

File B contains columns

address, id, postcode, dob, email

I would need to join these two files by matching id and dob in the two files to have an output file that would have the columns:

 id, name, lastname, dob, school, address, postcode, email

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

唱一曲作罢 2024-12-31 02:48:32

我想我只是创建一个新的 mysql/sqlite/任何数据库并插入行。应该是 ~20 行 Perl。

当然,这需要轻松访问数据库。

我猜您还可以按感兴趣的字段对文件进行排序,然后对于 file1 中的每一行查找并打印 file2 中的匹配行。

Think I would just create a new mysql/sqlite/whatever DB and insert the rows. Should be ~20 lines of perl.

This, of course, requires easy access to a DB..

Guess you could also sort the files by the interesting fields and then for each line in file1 find and print the matching lines in file2.

往日 2024-12-31 02:48:32

执行此操作的老式方法是使用系统实用程序按键顺序对两个文件进行排序,然后逐行匹配它们。读取两个文件,如果键匹配则输出数据。如果它们不匹配,请使用较小的密钥读取文件,直到它们匹配。如果文件达到 eof,则将文件的密钥设置为无限高。当两个键都无限高时,你就完成了。

The old fashioned way to do this is to use system utilities to sort both files in key sequence and then match them line by line. Read both files, if the keys match output the data. If they don't match, read the file with the lesser key until they do match. Set the key infinitely high for a file if it hits eof. When both keys are infinitely high, you're done.

怀里藏娇 2024-12-31 02:48:32

或者,仔细阅读这篇精彩的 Techrepublic 文章- 不过,您仍然可能需要 5G 内存。我想知道使用 unix/linux CLI 排序/连接实用程序会带您去哪里,提高效率。只是一个想法。

Or, peruse this nice Techrepublic article - you are still liable to need 5G of memory, though. I wonder where using the unix/linux CLI sort/join utilities would take you, efficiencywise. Just a thought.

┼── 2024-12-31 02:48:32

我实际上没有尝试过这个,但更具创意的解决方案可能是:

  1. 读取每个文件一次,并在唯一的 id+dob 组合及其在文件中的位置之间创建一个映射。使用tell()
  2. 在 perl 中创建映射
  3. 使用映射中的位置和 sysread()< 从文件中读取实际数据/a>
  4. 将数据写入新文件

I haven't actually tried this, but a more creative solution could be:

  1. Read each file once and create a map between the unique id+dob combinations and their positions in the file. Use tell().
  2. Create the map in perl
  3. Read the actual data from the files using the positions int he map and sysread()
  4. Write the data to a new file
岁吢 2024-12-31 02:48:32

您还可以使用我已经使用了 3 年的 CPAN 模块 Set::Relation,它旨在执行类似的操作,让您可以执行所有 SQL 功能,例如 Perl 中的 join。为每个文件创建一个 Set::Relation 对象,然后使用 join() 方法。也就是说,该模块的实现会将所有操作数和结果保留在内存中,因此它受到 RAM 的限制。但是您仍然可以查看其源代码以了解 join() 的工作原理,然后基于它实现更有效的版本以达到您的目的。

You can also use my 3-year-old CPAN module Set::Relation which is designed to do things like this, letting you do all the SQL features such as join in Perl. Create a Set::Relation object for each file and then use the join() method. That said, this module as-implemented will keep all your operands and result in memory, so it is limited by your RAM. But you can still look at its source for how join() works and then implement a more efficient version for your purposes based on it.

林空鹿饮溪 2024-12-31 02:48:32

另外,您可以尝试 DBD::AnyData

Also, you can try DBD::AnyData

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文