我试图使用Pyarrow V2将Pandas DataFrame保存到镶木格式,并遇到了一个奇怪的问题。 (简化的)数据框有一个字符串列和一个嵌套列(DICS列表)。这是一个示例:
labels = ["aa", 'bb', 'cc', 'dd']
vals = [random.choice(labels) for _ in range(2000)]
df = pd.DataFrame({'names': vals})
df["name_nested"] = df.names.apply(lambda x: [{"label": x}])
df.to_parquet("x.par")
如您所见,“名称”和“ name_nested”应该具有相同的值:
df.head(),
names name_nested
0 bb [{'label': 'bb'}]
1 aa [{'label': 'aa'}]
2 cc [{'label': 'cc'}]
3 cc [{'label': 'cc'}]
4 cc [{'label': 'cc'}]
但是,一旦我从磁盘中重新加载了保存的parquet文件,事情就很奇怪,我不会变得相同结果:
df2 = pd.read_parquet("x.par")
df2["name_nested2"] = df2.name_nested.apply(lambda x: x[0]["label"])
len(df2[df2.name_nested2 != df2.names])
# 726
在2000个条目中,有726个条目不匹配。这是一个示例:
df2[df2.name_nested2 != df2.names]
names name_nested name_nested2
1025 dd [{'label': 'cc'}] cc
1027 bb [{'label': 'aa'}] aa
1029 aa [{'label': 'cc'}] cc
1031 dd [{'label': 'aa'}] aa
1035 bb [{'label': 'dd'}] dd
如您所见,列 name_nested
与 name
不一样!这是非常错误的行为。我还注意到,只有在数据范围的行超过1024行时才发生,并且不匹配仅在第1024行之后才发生。
我认为这将是一个已知的问题,但找不到有关此的任何信息。
一旦我升级到Pyarrow 6,这不再是一个问题,而是想了解这一点的根本原因,如果有人以前看过。
I was trying to save a pandas dataframe to parquet format using Pyarrow v2, and I ran into a weird problem. The (simplified) dataframe has one string column and one nested column (list of dicts). Here is an example:
labels = ["aa", 'bb', 'cc', 'dd']
vals = [random.choice(labels) for _ in range(2000)]
df = pd.DataFrame({'names': vals})
df["name_nested"] = df.names.apply(lambda x: [{"label": x}])
df.to_parquet("x.par")
as you see the columns "names" and "name_nested" should have the same values:
df.head()
names name_nested
0 bb [{'label': 'bb'}]
1 aa [{'label': 'aa'}]
2 cc [{'label': 'cc'}]
3 cc [{'label': 'cc'}]
4 cc [{'label': 'cc'}]
However, once I reload from disk the saved parquet file, things are weird, I don't get same result:
df2 = pd.read_parquet("x.par")
df2["name_nested2"] = df2.name_nested.apply(lambda x: x[0]["label"])
len(df2[df2.name_nested2 != df2.names])
# 726
Out of the 2000 entries, there are 726 entries which do not match. here is an example:
df2[df2.name_nested2 != df2.names]
names name_nested name_nested2
1025 dd [{'label': 'cc'}] cc
1027 bb [{'label': 'aa'}] aa
1029 aa [{'label': 'cc'}] cc
1031 dd [{'label': 'aa'}] aa
1035 bb [{'label': 'dd'}] dd
As you see the column name_nested
is not the same as names
anymore! This is very wrong behavior. also I noticed that this happens only if the dataframe has more than 1024 rows, and the mismatches happen only after row 1024 too.
I thought this would be a known issue, but couldn't find any information about this.
Once I upgraded to Pyarrow 6, this is no longer an issue, but wanted to understand the root cause of this, if anyone have seen this before.
发布评论
评论(1)
Pyarrow版本2.0.0很旧,从那以后的镶木木支持上发生了许多改进。考虑PYARROW目前为8.0.0版。我建议您升级到最新版本,以从最近的所有工作中受益。
您的特定问题可能与在版本4.0.0中解决
PyArrow version 2.0.0 is pretty old, and many improvements have happened on the parquet support since then. Consider PyArrow is currently at version 8.0.0. I suggest you upgrade to the latest version to benefit from all the recent work.
Your specific issue might be related to https://issues.apache.org/jira/browse/ARROW-11607 which was addressed in version 4.0.0