读回Pyarrow保存的Pandas DataFrame给出错误的值

发布于 2025-02-02 19:06:04 字数 1314 浏览 3 评论 0 原文

我试图使用Pyarrow V2将Pandas DataFrame保存到镶木格式,并遇到了一个奇怪的问题。 (简化的)数据框有一个字符串列和一个嵌套列(DICS列表)。这是一个示例:

labels = ["aa", 'bb', 'cc', 'dd']
vals = [random.choice(labels) for _ in range(2000)]
df = pd.DataFrame({'names': vals})
df["name_nested"] = df.names.apply(lambda x: [{"label": x}])
df.to_parquet("x.par") 

如您所见,“名称”和“ name_nested”应该具有相同的值:

df.head(),

    names   name_nested
0   bb  [{'label': 'bb'}]
1   aa  [{'label': 'aa'}]
2   cc  [{'label': 'cc'}]
3   cc  [{'label': 'cc'}]
4   cc  [{'label': 'cc'}]

但是,一旦我从磁盘中重新加载了保存的parquet文件,事情就很奇怪,我不会变得相同结果:

df2 = pd.read_parquet("x.par")
df2["name_nested2"] = df2.name_nested.apply(lambda x: x[0]["label"])

len(df2[df2.name_nested2 != df2.names])

# 726

在2000个条目中,有726个条目不匹配。这是一个示例:

df2[df2.name_nested2 != df2.names] 

    names   name_nested name_nested2
1025    dd  [{'label': 'cc'}]   cc
1027    bb  [{'label': 'aa'}]   aa
1029    aa  [{'label': 'cc'}]   cc
1031    dd  [{'label': 'aa'}]   aa
1035    bb  [{'label': 'dd'}]   dd

如您所见,列 name_nested name 不一样!这是非常错误的行为。我还注意到,只有在数据范围的行超过1024行时才发生,并且不匹配仅在第1024行之后才发生。

我认为这将是一个已知的问题,但找不到有关此的任何信息。

一旦我升级到Pyarrow 6,这不再是一个问题,而是想了解这一点的根本原因,如果有人以前看过。

I was trying to save a pandas dataframe to parquet format using Pyarrow v2, and I ran into a weird problem. The (simplified) dataframe has one string column and one nested column (list of dicts). Here is an example:

labels = ["aa", 'bb', 'cc', 'dd']
vals = [random.choice(labels) for _ in range(2000)]
df = pd.DataFrame({'names': vals})
df["name_nested"] = df.names.apply(lambda x: [{"label": x}])
df.to_parquet("x.par") 

as you see the columns "names" and "name_nested" should have the same values:

df.head()

    names   name_nested
0   bb  [{'label': 'bb'}]
1   aa  [{'label': 'aa'}]
2   cc  [{'label': 'cc'}]
3   cc  [{'label': 'cc'}]
4   cc  [{'label': 'cc'}]

However, once I reload from disk the saved parquet file, things are weird, I don't get same result:

df2 = pd.read_parquet("x.par")
df2["name_nested2"] = df2.name_nested.apply(lambda x: x[0]["label"])

len(df2[df2.name_nested2 != df2.names])

# 726

Out of the 2000 entries, there are 726 entries which do not match. here is an example:

df2[df2.name_nested2 != df2.names] 

    names   name_nested name_nested2
1025    dd  [{'label': 'cc'}]   cc
1027    bb  [{'label': 'aa'}]   aa
1029    aa  [{'label': 'cc'}]   cc
1031    dd  [{'label': 'aa'}]   aa
1035    bb  [{'label': 'dd'}]   dd

As you see the column name_nested is not the same as names anymore! This is very wrong behavior. also I noticed that this happens only if the dataframe has more than 1024 rows, and the mismatches happen only after row 1024 too.

I thought this would be a known issue, but couldn't find any information about this.

Once I upgraded to Pyarrow 6, this is no longer an issue, but wanted to understand the root cause of this, if anyone have seen this before.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

初与友歌 2025-02-09 19:06:04

Pyarrow版本2.0.0很旧,从那以后的镶木木支持上发生了许多改进。考虑PYARROW目前为8.0.0版。我建议您升级到最新版本,以从最近的所有工作中受益。

您的特定问题可能与在版本4.0.0中解决

PyArrow version 2.0.0 is pretty old, and many improvements have happened on the parquet support since then. Consider PyArrow is currently at version 8.0.0. I suggest you upgrade to the latest version to benefit from all the recent work.

Your specific issue might be related to https://issues.apache.org/jira/browse/ARROW-11607 which was addressed in version 4.0.0

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文