与Pyarrow的链滤器

发布于 2025-01-23 16:57:49 字数 466 浏览 4 评论 0原文

我正在尝试使用多个参数在Pyarrow中搜索表。 看起来过滤器可以被束缚,但是我错过了使它实际上起作用的神奇咒语。

表从CSV加载,因此结构可以工作 - 我可以使用单个条件过滤,并且结果如预期。

链接过滤器:

table.filter(
   compute.equal(table['a'], a_val)
).filter(
   compute.equal(table['b'], b_val)
).filter(
   compute.equal(table['c'], b_val)
)

导致错误:

pyarrow.lib.ArrowInvalid: Filter inputs must all be the same length
    

我怀疑问题是第二个过滤器是在原始Table上,而不是第一个过滤器的过滤输出。

I am trying to search a table in pyarrow using multiple parameters.
It looks like filters can be chained, but I am missing the magical incantation to make it actually work.

Table is loaded from CSV, so the structure works — I can filter using a single condition and the results are as expected.

Chaining the filters:

table.filter(
   compute.equal(table['a'], a_val)
).filter(
   compute.equal(table['b'], b_val)
).filter(
   compute.equal(table['c'], b_val)
)

Results in an error:

pyarrow.lib.ArrowInvalid: Filter inputs must all be the same length
    

I suspect the issue is that the second filter is on the original table and not the filtered output of the first filter.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

泛滥成性 2025-01-30 16:57:49

您可以将2个过滤器与和_组合在一起:

import pyarrow as pa
import pyarrow.compute as compute


table = pa.Table.from_arrays(
    [
        pa.array([1,2,2], pa.int32()),
        pa.array(["foo","bar","hello"], pa.string())
    ],
    ['a', 'b']
)


compute.filter(
    table,
    compute.and_(
        compute.equal(table['a'], 2),
        compute.equal(table['b'], 'hello'),
    )
)

You can combine 2 filters together with and_:

import pyarrow as pa
import pyarrow.compute as compute


table = pa.Table.from_arrays(
    [
        pa.array([1,2,2], pa.int32()),
        pa.array(["foo","bar","hello"], pa.string())
    ],
    ['a', 'b']
)


compute.filter(
    table,
    compute.and_(
        compute.equal(table['a'], 2),
        compute.equal(table['b'], 'hello'),
    )
)
樱娆 2025-01-30 16:57:49

我相信您的怀疑是正确的。

第一个调用table.filter给出的输出表比原始表更小,但是您在第二个过滤器中的表达式仍然取决于原始表,该表现在是大的。

要解决此问题,只需将表简单地保存回第一个调用之后的变量就足够了。
例如这样:

table = table.filter(
   compute.equal(table['a'], a_val)
)
table.filter(
   compute.equal(table['b'], b_val)
)

I believe your suspicion is correct.

The first call to table.filter gives an output table that is smaller then the original table, but your expression in the second filter call still depends on the original table, which is now to large.
To fix this, it should be enough to simply save the table back to a variable after the first call.
For instance like this:

table = table.filter(
   compute.equal(table['a'], a_val)
)
table.filter(
   compute.equal(table['b'], b_val)
)
最初的梦 2025-01-30 16:57:49

链多滤波器的另一种方法是这样。好处是,您可以根据条件获得可选的过滤器。

filter_a = compute.is_in(df["col1"], pa.array(['a','b']))
filter_b = compute.is_in(df["col2"], pa.array(['a','b']))

masks = [filter_a, filter_b]

if True: # Optional Condition
    masks.append(compute.is_in(df["col3"], pa.array(['a','b']))


mask = [all(x) for x in zip(*masks)]
items = df.filter(mask).to_pylist()

Another way to chain multiple filters is like this. The benefit is that you can have optional filters based on conditions.

filter_a = compute.is_in(df["col1"], pa.array(['a','b']))
filter_b = compute.is_in(df["col2"], pa.array(['a','b']))

masks = [filter_a, filter_b]

if True: # Optional Condition
    masks.append(compute.is_in(df["col3"], pa.array(['a','b']))


mask = [all(x) for x in zip(*masks)]
items = df.filter(mask).to_pylist()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文