快速处理大量 CSV 数据的最佳方法

发布于 2024-10-30 21:21:51 字数 432 浏览 1 评论 0原文

我有大量 CSV 数据集(10M+ 行)需要处理。我还有另外两个文件需要在输出时引用,它们包含的数据放大了我们对 CSV 文件中数百万行的了解。目标是输出一个新的 CSV 文件,其中每条记录都与其他文件中的附加信息合并。

想象一下,大型 CSV 文件包含交易,但客户信息和账单信息记录在另外两个文件中,我们希望输出一个新的 CSV,其中每笔交易都链接到客户 ID 和帐户 ID 等。

一位同事有一个功能程序用Java编写来做到这一点,但速度非常慢。原因是,数百万行的 CSV 文件显然必须被遍历很多很多很多次。

我的问题是——是的,我正在解决这个问题——我应该如何在 Ruby 中解决这个问题?目标是让它更快(现在 18 个小时以上,CPU 活动很少)

我可以将这么多记录加载到内存中吗?如果是这样,我该怎么办?

我知道这有点模糊。只是寻找想法,因为这对我来说有点新鲜。

I have large CSV datasets (10M+ lines) that need to be processed. I have two other files that need to be referenced for the output—they contain data that amplifies what we know about the millions of lines in the CSV file. The goal is to output a new CSV file that has each record merged with the additional information from the other files.

Imagine that the large CSV file has transactions but the customer information and billing information is recorded in two other files and we want to output a new CSV that has each transaction linked to the customer ID and account ID, etc.

A colleague has a functional program written in Java to do this but it is very slow. The reason is that the CSV file with the millions of lines has to be walked through many, many, many times apparently.

My question is—yes, I am getting to it—how should I approach this in Ruby? The goal is for it to be faster (18+ hours right now with very little CPU activity)

Can I load this many records into memory? If so, how should I do it?

I know this is a little vague. Just looking for ideas as this is a little new to me.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

日久见人心 2024-11-06 21:21:51

这是我编写的一些用于处理大型 csv 文件(在我的情况下约为 180mb)的 ruby​​ 代码。

https://gist.github.com/1323865

标准 FasterCSV.parse 将其全部拉入内存一个多小时。这样就把时间缩短到了 10 分钟左右。

相关部分是这样的:

lines = []
IO.foreach('/tmp/zendesk_tickets.csv') do |line|
  lines << line
  if lines.size >= 1000
    lines = FasterCSV.parse(lines.join) rescue next
    store lines
    lines = []
  end
end
store lines

IO.foreach 不会将整个文件加载到内存中,而只是使用缓冲区逐步遍历它。当达到 1000 行时,它会尝试解析 csv 并仅插入这些行。一个棘手的部分是“下一步救援”。如果您的 CSV 包含一些跨多行的字段,您可能需要多获取几行才能获取有效的可解析 csv 字符串。否则,您所在的线路可能位于田野的中间。

在要点中,您可以看到另一个很好的优化,它使用 MySQL 的更新ON DUPLICATE KEY。这允许您批量插入,如果检测到重复键,它只会覆盖该行中的值,而不是插入新行。您可以将其视为一个查询中的创建/更新。您需要在至少一列上设置唯一索引才能使其正常工作。

Here is some ruby code I wrote to process large csv files (~180mb in my case).

https://gist.github.com/1323865

A standard FasterCSV.parse pulling it all into memory was taking over an hour. This got it down to about 10 minutes.

The relevant part is this:

lines = []
IO.foreach('/tmp/zendesk_tickets.csv') do |line|
  lines << line
  if lines.size >= 1000
    lines = FasterCSV.parse(lines.join) rescue next
    store lines
    lines = []
  end
end
store lines

IO.foreach doesn't load the entire file into memory and just steps through it with a buffer. When it gets to 1000 lines, it tries parsing a csv and inserting just those rows. One tricky part is the "rescue next". If your CSV has some fields that span multiple lines, you may need to grab a few more lines to get a valid parseable csv string. Otherwise the line you're on could be in the middle of a field.

In the gist you can see one other nice optimization which uses MySQL's update ON DUPLICATE KEY. This allows you to insert in bulk and if a duplicate key is detected it simply overwrites the values in that row instead of inserting a new row. You can think of it like a create/update in one query. You'll need to set a unique index on at least one column for this to work.

把昨日还给我 2024-11-06 21:21:51

使用数据库怎么样。

将记录塞入表中,然后使用联接查询它们。

导入可能需要一段时间,但数据库引擎将为连接和检索部分进行优化...

how about using a database.

jam the records into tables, and then query them out using joins.

the import might take awhile, but the DB engine will be optimized for the join and retrieval part...

守不住的情 2024-11-06 21:21:51

10M+ 行听起来并没有那么多。如果您可以预加载文件的内容并将内存中的数据与合适的数据结构相匹配(在某些时候您会需要地图),那么您将不必一遍又一遍地运行 CSV 文件。文件访问速度

10M+ rows doesn't really sound like that much. If you can preload the contents of the files and match up the data in memory with decent data structures (you'll want maps at some point), you won't have to keep running through the CSV files over and over. File access is SLOW.

带上头具痛哭 2024-11-06 21:21:51

两个相当快的选项:

  1. 将数据放入 sqlite DB 中。然后,这是一个带有一对 join 的简单查询,其执行速度比您自己编写的任何查询都要快 - SQL 非常适合此类任务。

  2. 假设您的附加 CSV 文件足够小,可以放入 RAM,您可以使用客户 ID 作为密钥将所有内容读入哈希值,然后在处理具有 10+M 条记录的主文件时查找该哈希值。请注意,只需将查找数据放入 RAM,主列表可以在小分支中处理。

Two reasonably fast options:

  1. Put your data into sqlite DB. Then it's a simple query with pair of join that would perform way faster than anything you could write yourself -- SQL is very good for this kind of tasks.

  2. Assuming your additional CSV files are small enough to fit into RAM, you can read everything into hash, using customer ID as a key, then look up that hash when processing main file with 10+M records. Note that it's only necessary to put lookup data into RAM, main list can be processed in small branches.

若言繁花未落 2024-11-06 21:21:51

我的经验是,使用 Ruby 时,请准备好实际负载的 10 倍左右的内存使用量。当然,对于当前的 RAM 量,如果进程一次仅加载一个文件,则即使乘以 10,10MB 也几乎可以忽略不计:)

如果您一次可以读取一行(对于 File 实例来说这很容易),您也可以使用 FasterCSV 并一次写入一行。这将使内存消耗 O(1) 而不是 O(n)。但是,对于 10 MB 的文件,只要在任何给定时间只有很少的进程,您就可以一次性将该文件写入内存并将其写入 CSV。

My experience is that with Ruby, prepare to have about 10x memory usage of the actual payload. Of course, with current amounts of RAM, if the process loads only one file at a time, 10MB is almost negligible even when multiplied by ten :)

If you can read one line at a time (which is easy with File instances), you could use FasterCSV and write one line at a time as well. That would make memory consumption O(1) instead of O(n). But with 10 megabyte files you can probably slurp that file to memory and write it to CSV in one pass, given only few processes at any given time.

殤城〤 2024-11-06 21:21:51

如果您编写了 Java 程序,请确保使用 NIO 库。它们比默认的要快得多。我之前使用 NIO 库处理过 500,000 行的文本文件。

If you have a Java program written make sure you use the NIO libraries. They are way faster than the default. I have processed text files with 500,000 lines using the NIO libraries before.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文