将数据从多个镶木件文件检索到一个数据框中(Python)
首先,我要说这是我第一次使用镶木木文件。我有一个从S3存储桶中下载的2615个镶木文件的列表,我想将其读为一个数据框架。他们遵循相同的文件夹结构,我在下面列出一个示例:
/as_of_date=2022-02-02-01/type=full/export_country = spain = spain = spain/import_country=france/000.parquet.parquet'
文件名称000.Parquet
总是相同的,无论文件夹如何。
我使用以下功能保存了所有文件位置:
import os
def list_files(dir):
r = []
for root, dirs, files in os.walk(dir):
for name in files:
r.append(os.path.join(root, name))
return r
这将生成所有文件位置的列表,就像上面的文件夹示例一样。
我尝试使用的下一件事是使用dask将所有镶木件文件读取到dask dataframe中,但似乎不起作用。
import dask.dataframe as dd
dask_df = dd.read_parquet(data_files)
我不断遇到此错误,尽管我了解问题在哪里,但我不确定如何解决。这是因为文件包含列export_country
和import_country
,这也是分区:
ValueError: No partition-columns should be written in the
file unless they are ALL written in the file.
我尝试使用的另一种解决方案是使用PANDAS通过每个Parquet文件迭代并将所有内容组合到一个数据范围中。
df = pd.DataFrame()
for f in data_files:
data = pd.read_parquet(f,engine = 'pyarrow')
df = df.append(data)
由于没有更多的RAM,这似乎花费了很长时间,而我的内核死亡。
I want to start by saying this is the first time I work with Parquet files. I have a list of 2615 parquet files that I downloaded from an S3 bucket and I want to read them into one dataframe. They follow the same folder structure and I am putting an example below:
/Forecasting/as_of_date=2022-02-01/type=full/export_country=Spain/import_country=France/000.parquet'
The file name 000.parquet
is always the same, irrespective of folder.
I saved all of the file locations using the following function:
import os
def list_files(dir):
r = []
for root, dirs, files in os.walk(dir):
for name in files:
r.append(os.path.join(root, name))
return r
This generates a list of all file locations, exactly like in the folder example above.
The next thing I tried was using DASK to read all of the parquet files into a dask dataframe but it doesn't seem to work.
import dask.dataframe as dd
dask_df = dd.read_parquet(data_files)
I keep getting this error and I'm not sure how to fix it, although I understand where the issue is. It's because the files contain the columns export_country
and import_country
, which are also partitions:
ValueError: No partition-columns should be written in the
file unless they are ALL written in the file.
Another solution I tried using was iterating through each parquet file using pandas and combining everything into one dataframe.
df = pd.DataFrame()
for f in data_files:
data = pd.read_parquet(f,engine = 'pyarrow')
df = df.append(data)
This seems to take ages and my kernel dies due to no more RAM available.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
与多次附加相比,进行单个cont的速度更快:
但是我怀疑它有助于记忆限制。
It's faster to do a single concat compared to append multiple times:
but I doubt it helps with the memory limitation.
@Learning的变体是混乱的答案,但使用
dd.concat
:A variation of @Learning is a mess's answer, but using
dd.concat
:在Pandas 1.3 +中,
您可以直接读取文件夹,而Pandas会为您限制它们:
In Pandas 1.3 +
You can just read the folder directly, and pandas will concat them for you: