在Pyarrow中分区的镶木木材的情况下，推断正确的模式

发布于 2025-02-13 15:30:00 字数 693 浏览 1 评论 0原文

我有一个很大的分区组合，需要访问将其全部加载或过滤。由于此错误ARRONNOTIMPLEMENTERROR：使用函数Cast_Null，我无法加载整个文件。来自这个问题事实证明这是链接到第一个使用的问题推断总模式。如果我生成a unifiedSchema 并使用它来加载整个数据框架，则根本没有问题，但是当我应用过滤器传递架构时，弹出以下错误会弹出： arrowinvalid：product_tag中的fieldref.name（market_tag）没有匹配：字符串

df = pd.read_parquet("data/reduced.parquet",filters=[("MARKET_TAG","=",3)],schema=unifiedSchema)

有一种解决此问题的方法吗？

我一直在考虑将问题直接解决，以 pq.write_to_dataset 来存储正确的模式在同一文件夹中。在这种情况下，与另一个相比，表格模式可能是“错误的”，而不是解决我的问题。

原文

I have a big partitioned parquet that I need to access either loading it all or filtering it. I am not able to load the entire file because of this error ArrowNotImplementedError: Unsupported cast from string to null using function cast_null. From this issue it turns out to be a problem linked to the first file used to infer the overall schema. If I generate a unifiedSchema and use it to load the entire dataframe I have no problem at all, but when I apply a filter passing the schema the following error pops up:
ArrowInvalid: No match for FieldRef.Name(MARKET_TAG) in PRODUCT_TAG: string

df = pd.read_parquet("data/reduced.parquet",filters=[("MARKET_TAG","=",3)],schema=unifiedSchema)

There is a way to solve this?

I was thinking to solve the problem directly storing the correct schema with pq.write_to_dataset but I use to work on several datasets with a single "MARKET_TAG" and to store the parquets at the end using partition_cols in the same folder. In this case, the table schema could be "wrong" compared with the other, not solving my issue.

分享到QQ

分享到微博