在dask中读取木木quet文件返回空框架

发布于 2025-02-12 17:48:26 字数 2740 浏览 0 评论 0 原文

我正在尝试复制 this dataShader+parquet+parquet+dask+dask 用自己的数据做类似的事情。这是git代码。

复制的步骤包括运行 jupyter Notebook 为了转换4将csv文件添加到镶木quet文件中。我可以在毫无问题的情况下运行此代码，它会创建一个带有大约70 MB大小的文件的镶木点目录，但是当我尝试读取Parquet文件时，它会返回一个空数据框（但带有正确的列）。因此，在将CSV读取到dask数据框中并进行一些处理后，我可以检查头（）：

ddf.head()
radio   mcc net area    cell    unit    lon lat range   samples changeable  created updated averageSignal   x_3857  y_3857
0   UMTS    262 2   801 86355   0   13.285512   52.522202   1000    7   1   1282569574000000000 1300155341000000000 0   1.478936e+06    6.895103e+06
1   GSM 262 2   801 1795    0   13.276907   52.525714   5716    9   1   1282569574000000000 1300155341000000000 0   1.477979e+06    6.895745e+06
2   GSM 262 2   801 1794    0   13.285064   52.524000   6280    13  1   1282569574000000000 1300796207000000000 0   1.478887e+06    6.895432e+06
3   UMTS    262 2   801 211250  0   13.285446   52.521744   1000    3   1   1282569574000000000 1299466955000000000 0   1.478929e+06    6.895019e+06
4   UMTS    262 2   801 86353   0   13.293457   52.521515   1000    2   1   1282569574000000000 1291380444000000000 0   1.479821e+06    6.894977e+06

将其写入parquet：

# Write parquet file to ../data directory
os.makedirs('./data', exist_ok=True)
parquet_path = './data/cell_towers.parq'
ddf.to_parquet(parquet_path, 
               compression='snappy',  
               write_metadata_file = True)

尝试从parquet读取：

ddy = dd.read_parquet('./data/cell_towers.parq' )

但是它返回和空数据帧，但是使用正确的列名称：

ddy.head(3)
> radio mcc net area    cell    unit    lon lat range   samples changeable  created updated averageSignal   x_3857  y_3857

len(ddy)
> 0

这是第一次我使用dask数据范围和镶木拼图时，它似乎就像它应该起作用，但是我在这里可能会缺少一些基本概念。

小型可复制代码段：

import pandas as pd
import dask.dataframe as dd

ddfx = dd.from_pandas(pd.DataFrame(range(10), columns=['A']), npartitions=2)
parquet_path = './dummy.parq'
ddfx.to_parquet(parquet_path, 
               compression='snappy',  
               write_metadata_file = True)
ddfy = dd.read_parquet('./dummy.parq' ) 
print('Input DDF length: {0} . Output DDF length: {1}'.format(len(ddfx), len(ddfy)))

输入DDF长度：10。输出DDF长度：0

如何将DDF写入Parquet然后阅读？

原文

I'm trying to replicate this datashader+parquet+dask+dash example in order to do something similar with my own data. Here's the git code.

The steps to replicate include running the jupyter notebook in order to convert the 4 gig csv file into a parquet file. I can run this code without issue, it creates a parquet directory with many ~70 mb sized files in it, but when I try to read the parquet file it returns an empty dataframe (but with the correct columns). So after reading the csv into a dask dataframe and doing some processing I can check the head():

ddf.head()
radio   mcc net area    cell    unit    lon lat range   samples changeable  created updated averageSignal   x_3857  y_3857
0   UMTS    262 2   801 86355   0   13.285512   52.522202   1000    7   1   1282569574000000000 1300155341000000000 0   1.478936e+06    6.895103e+06
1   GSM 262 2   801 1795    0   13.276907   52.525714   5716    9   1   1282569574000000000 1300155341000000000 0   1.477979e+06    6.895745e+06
2   GSM 262 2   801 1794    0   13.285064   52.524000   6280    13  1   1282569574000000000 1300796207000000000 0   1.478887e+06    6.895432e+06
3   UMTS    262 2   801 211250  0   13.285446   52.521744   1000    3   1   1282569574000000000 1299466955000000000 0   1.478929e+06    6.895019e+06
4   UMTS    262 2   801 86353   0   13.293457   52.521515   1000    2   1   1282569574000000000 1291380444000000000 0   1.479821e+06    6.894977e+06

write it to parquet:

# Write parquet file to ../data directory
os.makedirs('./data', exist_ok=True)
parquet_path = './data/cell_towers.parq'
ddf.to_parquet(parquet_path, 
               compression='snappy',  
               write_metadata_file = True)

and attempt to read from parquet:

ddy = dd.read_parquet('./data/cell_towers.parq' )

but it returns and empty dataframe, but with the correct column names:

ddy.head(3)
> radio mcc net area    cell    unit    lon lat range   samples changeable  created updated averageSignal   x_3857  y_3857

len(ddy)
> 0

This is the first time I'm using dask dataframes and parquet, it seems like it should just work but there might be some basic concept I'm missing here.

Small replicable code snippet:

import pandas as pd
import dask.dataframe as dd

ddfx = dd.from_pandas(pd.DataFrame(range(10), columns=['A']), npartitions=2)
parquet_path = './dummy.parq'
ddfx.to_parquet(parquet_path, 
               compression='snappy',  
               write_metadata_file = True)
ddfy = dd.read_parquet('./dummy.parq' ) 
print('Input DDF length: {0} . Output DDF length: {1}'.format(len(ddfx), len(ddfy)))