熊猫负载和汇总大数据

发布于 2025-02-12 05:15:18 字数 636 浏览 0 评论 0原文

我有一个近4〜6m行的数据框，这需要很多记忆才能加载。如果我只是一个一个人阅读并处理数据，那就可以了，但是问题是我需要汇总数据。（例如总和或平均）

，也许我只能为我的工人分配更多的回忆，但是我不知道将来的数据将有多大。

因此，我的第一个想法是从数据库中部分获取数据，并汇总每个块的数据，然后结合结果：

# Pseudo code
aggregate_results = []
for i in range(number_of_chunks):
    data = preprocess(get_data(i))
    aggregate_results.append(aggregate(data))
final_result = combine(aggregate_results)

然后我认为已经有了解决此类工作的解决方案。

我还有其他选择吗？我正在使用AWS RedShift存储数据和Apache气流来安排任务，但是我对Spark或Hadoop等大数据解决方案没有任何了解。

我过去曾经使用SQL来汇总数据，但是现在我正在使用PANDAS，因为我需要在聚合之前对数据进行预处理。因此，直接使用数据库不是一个选项。

任何帮助将不胜感激。

原文

I have a data frame with nearly 4~6M rows, which takes quite a lot memories to load. It would be fine if I just read and process the data one by one, but the problem is that I need to aggregate the data. (like sum or average)

Maybe I could just assign more memories to my worker, but I don't know how large the data will be in the future.

So my first thought is to take the data partially from the database, and aggregate the data for each chunk, and then combine the results:

# Pseudo code
aggregate_results = []
for i in range(number_of_chunks):
    data = preprocess(get_data(i))
    aggregate_results.append(aggregate(data))
final_result = combine(aggregate_results)

Then I thought there might be a solution for this kind of work already.

Is there any other options that I can take? I'm using AWS Redshift to store the data and Apache Airflow to schedule the tasks, but I don't have any knowledge about big data solutions like Spark or Hadoop.

I used to use SQL to aggregate the data in the past, but now I'm using Pandas because I need to preprocess the data before the aggregation. So just using database directly is not an option.

Any help would be appreciated.

分享到QQ

分享到微博