如何在 Perl 中执行类似 SQL 的 Join?
我必须通过组合两个不同的文件来处理一些数据。它们都有两列,它们将形成一个主键,我可以用它来并排匹配它们。问题中的文件很大(大约 5GB,有 2000 万行),所以我需要一个高效的代码。我该如何在 Perl 中做到这一点?
我举一个例子:
如果文件 A 包含列
id, name, lastname, dob, school
文件 B 包含列,
address, id, postcode, dob, email
我需要通过匹配两个文件中的 id 和 dob 来连接这两个文件以获得输出文件将会有以下列:
id, name, lastname, dob, school, address, postcode, email
I have to process some data by combining two different files. Both of them have two columns that would form a primary key that I can use to match them side-by-side. The files in questions are huge (around 5GB with 20 million rows) so I would need an efficient code. How would I do this in Perl?
I give an example:
If File A contains columns
id, name, lastname, dob, school
File B contains columns
address, id, postcode, dob, email
I would need to join these two files by matching id and dob in the two files to have an output file that would have the columns:
id, name, lastname, dob, school, address, postcode, email
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我想我只是创建一个新的 mysql/sqlite/任何数据库并插入行。应该是 ~20 行 Perl。
当然,这需要轻松访问数据库。
我猜您还可以按感兴趣的字段对文件进行排序,然后对于 file1 中的每一行查找并打印 file2 中的匹配行。
Think I would just create a new mysql/sqlite/whatever DB and insert the rows. Should be ~20 lines of perl.
This, of course, requires easy access to a DB..
Guess you could also sort the files by the interesting fields and then for each line in file1 find and print the matching lines in file2.
执行此操作的老式方法是使用系统实用程序按键顺序对两个文件进行排序,然后逐行匹配它们。读取两个文件,如果键匹配则输出数据。如果它们不匹配,请使用较小的密钥读取文件,直到它们匹配。如果文件达到 eof,则将文件的密钥设置为无限高。当两个键都无限高时,你就完成了。
The old fashioned way to do this is to use system utilities to sort both files in key sequence and then match them line by line. Read both files, if the keys match output the data. If they don't match, read the file with the lesser key until they do match. Set the key infinitely high for a file if it hits eof. When both keys are infinitely high, you're done.
或者,仔细阅读这篇精彩的 Techrepublic 文章- 不过,您仍然可能需要 5G 内存。我想知道使用 unix/linux CLI 排序/连接实用程序会带您去哪里,提高效率。只是一个想法。
Or, peruse this nice Techrepublic article - you are still liable to need 5G of memory, though. I wonder where using the unix/linux CLI sort/join utilities would take you, efficiencywise. Just a thought.
我实际上没有尝试过这个,但更具创意的解决方案可能是:
I haven't actually tried this, but a more creative solution could be:
您还可以使用我已经使用了 3 年的 CPAN 模块 Set::Relation,它旨在执行类似的操作,让您可以执行所有 SQL 功能,例如 Perl 中的 join。为每个文件创建一个 Set::Relation 对象,然后使用 join() 方法。也就是说,该模块的实现会将所有操作数和结果保留在内存中,因此它受到 RAM 的限制。但是您仍然可以查看其源代码以了解 join() 的工作原理,然后基于它实现更有效的版本以达到您的目的。
You can also use my 3-year-old CPAN module Set::Relation which is designed to do things like this, letting you do all the SQL features such as join in Perl. Create a Set::Relation object for each file and then use the join() method. That said, this module as-implemented will keep all your operands and result in memory, so it is limited by your RAM. But you can still look at its source for how join() works and then implement a more efficient version for your purposes based on it.
另外,您可以尝试 DBD::AnyData
Also, you can try DBD::AnyData