在Pyarrow中分区的镶木木材的情况下,推断正确的模式
我有一个很大的分区组合,需要访问将其全部加载或过滤。由于此错误ARRONNOTIMPLEMENTERROR:使用函数Cast_Null
,我无法加载整个文件。来自这个问题事实证明这是链接到第一个使用的问题推断总模式。如果我生成a unifiedSchema 并使用它来加载整个数据框架,则根本没有问题,但是当我应用过滤器传递架构时,弹出以下错误会弹出: arrowinvalid:product_tag中的fieldref.name(market_tag)没有匹配:字符串
df = pd.read_parquet("data/reduced.parquet",filters=[("MARKET_TAG","=",3)],schema=unifiedSchema)
有一种解决此问题的方法吗?
我一直在考虑将问题直接解决,以 pq.write_to_dataset 来存储正确的模式在同一文件夹中。在这种情况下,与另一个相比,表格模式可能是“错误的”,而不是解决我的问题。
I have a big partitioned parquet that I need to access either loading it all or filtering it. I am not able to load the entire file because of this error ArrowNotImplementedError: Unsupported cast from string to null using function cast_null
. From this issue it turns out to be a problem linked to the first file used to infer the overall schema. If I generate a unifiedSchema and use it to load the entire dataframe I have no problem at all, but when I apply a filter passing the schema the following error pops up:ArrowInvalid: No match for FieldRef.Name(MARKET_TAG) in PRODUCT_TAG: string
df = pd.read_parquet("data/reduced.parquet",filters=[("MARKET_TAG","=",3)],schema=unifiedSchema)
There is a way to solve this?
I was thinking to solve the problem directly storing the correct schema with pq.write_to_dataset but I use to work on several datasets with a single "MARKET_TAG" and to store the parquets at the end using partition_cols in the same folder. In this case, the table schema could be "wrong" compared with the other, not solving my issue.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论