当列包含集合时如何保存 pandas 数据框

发布于 2025-01-18 12:00:08 字数 784 浏览 3 评论 0原文

当尝试保存其中一列包含集合的 pandas 数据框时(请参见下面的示例),

import pandas as pd

df = pd.DataFrame({"col_set": [{"A", "B", "C"}, {"D", "E", "F"}]})
df.to_parquet("df_w_col_set.parquet")

会引发以下错误:

ArrowInvalid: ("Could not convert {'C', 'B', 'A'} with type set: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column col_set with type object')

如何保存这种数据框并避免上述错误?

一些半相关的帖子提到提供西洋蓍草模式,但我不清楚在咨询时使用什么类型 pyarrow 数据类型

代码使用 python 3.7.4pandas==1.3.0pyarrow==3.0.0 运行

主要寻找升级的解决方案不需要或真正最小化(以避免破坏其他依赖项)。

When trying to save a pandas dataframe where a column contains set (see example below)

import pandas as pd

df = pd.DataFrame({"col_set": [{"A", "B", "C"}, {"D", "E", "F"}]})
df.to_parquet("df_w_col_set.parquet")

The following error is thrown:

ArrowInvalid: ("Could not convert {'C', 'B', 'A'} with type set: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column col_set with type object')

How can one save this kind of dataframe and avoid the error above?

Some semi related posts mention providing a yarrow schema but I'm not clear on what type to use when consulting pyarrow datatypes.

Code was run with python 3.7.4, pandas==1.3.0 and pyarrow==3.0.0

Mainly looking for a solution where upgrades are not needed or really minimized(to avoid breaking other dependencies).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

地狱即天堂 2025-01-25 12:00:08

作为解决方法,您可以将您的 set 转换为字符串,并使用 ast.literal_eval 将您的字符串评估为 set

import ast

df.astype({'col_set': str}).to_parquet('data.parquet')
df1 = pd.read_parquet('data.parquet') \
        .assign(col_set=lambda x: x['col_set'].map(ast.literal_eval))
print(df1)

# Output
     col_set
0  {C, B, A}
1  {F, E, D}

或者您可以将您的集合转换为tuple(或list)然后恢复为set

df.assign(col_set=df['col_set'].map(tuple)).to_parquet('test.parquet')
df1 = pd.read_parquet('test.parquet') \
        .assign(col_set=lambda x: x['col_set'].map(set))
print(df1)

# Output
     col_set
0  {C, B, A}
1  {F, E, D}

您还可以使用pickle.dumpspickle .loads 来序列化你的set

import pickle

df.assign(col_set=df['col_set'].map(pickle.dumps)).to_parquet('test.parquet')
df1 = pd.read_parquet('test.parquet') \
        .assign(col_set=lambda x: x['col_set'].map(pickle.loads))
print(df1)

# Output
     col_set
0  {C, B, A}
1  {F, E, D}

事实上,您可以选择任何(非)序列化方法(JSON 除外,因为 set 不存在)。

As workaround, you can convert your set to string and use ast.literal_eval to evaluate your string as set:

import ast

df.astype({'col_set': str}).to_parquet('data.parquet')
df1 = pd.read_parquet('data.parquet') \
        .assign(col_set=lambda x: x['col_set'].map(ast.literal_eval))
print(df1)

# Output
     col_set
0  {C, B, A}
1  {F, E, D}

Or you can convert your set to tuple (or list) then revert to set:

df.assign(col_set=df['col_set'].map(tuple)).to_parquet('test.parquet')
df1 = pd.read_parquet('test.parquet') \
        .assign(col_set=lambda x: x['col_set'].map(set))
print(df1)

# Output
     col_set
0  {C, B, A}
1  {F, E, D}

You can also use pickle.dumps and pickle.loads to serialize your set:

import pickle

df.assign(col_set=df['col_set'].map(pickle.dumps)).to_parquet('test.parquet')
df1 = pd.read_parquet('test.parquet') \
        .assign(col_set=lambda x: x['col_set'].map(pickle.loads))
print(df1)

# Output
     col_set
0  {C, B, A}
1  {F, E, D}

In fact, you can choose any (un)serialization method (except JSON because set does not exist).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文