将数据从多个镶木件文件检索到一个数据框中（Python）

发布于 2025-01-30 04:54:16 字数 1174 浏览 1 评论 0原文

首先，我要说这是我第一次使用镶木木文件。我有一个从S3存储桶中下载的2615个镶木文件的列表，我想将其读为一个数据框架。他们遵循相同的文件夹结构，我在下面列出一个示例：

/as_of_date=2022-02-02-01/type=full/export_country = spain = spain = spain/import_country=france/000.parquet.parquet'

文件名称000.Parquet总是相同的，无论文件夹如何。

我使用以下功能保存了所有文件位置：

import os
def list_files(dir):
    r = []
    for root, dirs, files in os.walk(dir):
        for name in files:
            r.append(os.path.join(root, name))
    return r

这将生成所有文件位置的列表，就像上面的文件夹示例一样。

我尝试使用的下一件事是使用dask将所有镶木件文件读取到dask dataframe中，但似乎不起作用。

import dask.dataframe as dd
dask_df = dd.read_parquet(data_files)

我不断遇到此错误，尽管我了解问题在哪里，但我不确定如何解决。这是因为文件包含列export_country和import_country，这也是分区：

ValueError: No partition-columns should be written in the 
file unless they are ALL written in the file.

我尝试使用的另一种解决方案是使用PANDAS通过每个Parquet文件迭代并将所有内容组合到一个数据范围中。

df = pd.DataFrame()
for f in data_files:
    data = pd.read_parquet(f,engine = 'pyarrow')
    df = df.append(data)

由于没有更多的RAM，这似乎花费了很长时间，而我的内核死亡。

原文

I want to start by saying this is the first time I work with Parquet files. I have a list of 2615 parquet files that I downloaded from an S3 bucket and I want to read them into one dataframe. They follow the same folder structure and I am putting an example below:

/Forecasting/as_of_date=2022-02-01/type=full/export_country=Spain/import_country=France/000.parquet'

The file name 000.parquet is always the same, irrespective of folder.

I saved all of the file locations using the following function:

import os
def list_files(dir):
    r = []
    for root, dirs, files in os.walk(dir):
        for name in files:
            r.append(os.path.join(root, name))
    return r

This generates a list of all file locations, exactly like in the folder example above.

The next thing I tried was using DASK to read all of the parquet files into a dask dataframe but it doesn't seem to work.

import dask.dataframe as dd
dask_df = dd.read_parquet(data_files)

I keep getting this error and I'm not sure how to fix it, although I understand where the issue is. It's because the files contain the columns export_country and import_country, which are also partitions:

ValueError: No partition-columns should be written in the 
file unless they are ALL written in the file.

Another solution I tried using was iterating through each parquet file using pandas and combining everything into one dataframe.

df = pd.DataFrame()
for f in data_files:
    data = pd.read_parquet(f,engine = 'pyarrow')
    df = df.append(data)

This seems to take ages and my kernel dies due to no more RAM available.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

想念有你 2025-02-06 04:54:16

与多次附加相比，进行单个cont的速度更快：

df = pd.concat((pd.read_parquet(f, engine = 'pyarrow') for f in data_files))

但是我怀疑它有助于记忆限制。

It's faster to do a single concat compared to append multiple times:

df = pd.concat((pd.read_parquet(f, engine = 'pyarrow') for f in data_files))

but I doubt it helps with the memory limitation.

回复收藏 0 原文

丶视觉 2025-02-06 04:54:16

@Learning的变体是混乱的答案，但使用dd.concat：

from dask.dataframe import read_parquet, concat
dask_df = concat([read_parquet(f) for f in data_files])

A variation of @Learning is a mess's answer, but using dd.concat:

from dask.dataframe import read_parquet, concat
dask_df = concat([read_parquet(f) for f in data_files])

回复收藏 0 原文

辞慾 2025-02-06 04:54:16

在Pandas 1.3 +中，

您可以直接读取文件夹，而Pandas会为您限制它们：

df = pd.read_parquet("path/to/folder/")
df.head()

In Pandas 1.3 +

You can just read the folder directly, and pandas will concat them for you:

df = pd.read_parquet("path/to/folder/")
df.head()

回复收藏 0 原文

~没有更多了~

关于作者

柒七

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

将数据从多个镶木件文件检索到一个数据框中（Python）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

在Pandas 1.3 +中，

In Pandas 1.3 +

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

将数据从多个镶木件文件检索到一个数据框中（Python）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

在Pandas 1.3 +中，

In Pandas 1.3 +

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。