当列包含集合时如何保存 pandas 数据框
当尝试保存其中一列包含集合的 pandas 数据框时(请参见下面的示例),
import pandas as pd
df = pd.DataFrame({"col_set": [{"A", "B", "C"}, {"D", "E", "F"}]})
df.to_parquet("df_w_col_set.parquet")
会引发以下错误:
ArrowInvalid: ("Could not convert {'C', 'B', 'A'} with type set: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column col_set with type object')
如何保存这种数据框并避免上述错误?
一些半相关的帖子提到提供西洋蓍草模式,但我不清楚在咨询时使用什么类型 pyarrow 数据类型。
代码使用 python 3.7.4
、pandas==1.3.0
和 pyarrow==3.0.0
运行
主要寻找升级的解决方案不需要或真正最小化(以避免破坏其他依赖项)。
When trying to save a pandas dataframe where a column contains set (see example below)
import pandas as pd
df = pd.DataFrame({"col_set": [{"A", "B", "C"}, {"D", "E", "F"}]})
df.to_parquet("df_w_col_set.parquet")
The following error is thrown:
ArrowInvalid: ("Could not convert {'C', 'B', 'A'} with type set: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column col_set with type object')
How can one save this kind of dataframe and avoid the error above?
Some semi related posts mention providing a yarrow schema but I'm not clear on what type to use when consulting pyarrow datatypes.
Code was run with python 3.7.4
, pandas==1.3.0
and pyarrow==3.0.0
Mainly looking for a solution where upgrades are not needed or really minimized(to avoid breaking other dependencies).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
作为解决方法,您可以将您的
set
转换为字符串,并使用ast.literal_eval
将您的字符串评估为set
:或者您可以将您的集合转换为
tuple
(或list
)然后恢复为set
:您还可以使用
pickle.dumps
和pickle .loads
来序列化你的set
:事实上,您可以选择任何(非)序列化方法(JSON 除外,因为
set
不存在)。As workaround, you can convert your
set
to string and useast.literal_eval
to evaluate your string asset
:Or you can convert your set to
tuple
(orlist
) then revert toset
:You can also use
pickle.dumps
andpickle.loads
to serialize yourset
:In fact, you can choose any (un)serialization method (except JSON because
set
does not exist).