如何使用 pandasql 在 pandas DataFrames 上执行复杂的条件外连接

发布于 2025-01-21 01:48:54 字数 738 浏览 2 评论 0原文

这个问题花了我很长时间来解决,我想分享我的解决方案。这是问题。

我们有2个PANDAS数据框,需要在非常复杂的条件下外部连接。这是我的:

condition_statement = """
    ON (
        A.var0 = B.var0
        OR (
            A.var1 = B.var1
            AND (
                A.var2 = B.var2
                OR A.var3 = B.var3
                OR A.var4 = B.var4
                OR A.var5 = B.var5
                OR A.var6 = B.var6
                OR A.var7 = B.var7
                OR (
                    A.var8 = B.var8
                    AND A.var9 = B.var9
                )
            )
        )
    )
"""

在大熊猫中这样做将是一场噩梦。

我喜欢使用Pandasql软件包进行大多数数据框架按摩。它允许您在本地环境中的数据范围顶部运行SQL查询。

Pandasql的问题在于它在SQLite引擎上运行,因此您不能做正确或完整的外部连接。

那么您如何解决这个问题?

This is a problem that took me a long time to solve, and I wanted to share my solution. Here's the problem.

We have 2 pandas DataFrames that need to be outer joined on a very complex condition. Here was mine:

condition_statement = """
    ON (
        A.var0 = B.var0
        OR (
            A.var1 = B.var1
            AND (
                A.var2 = B.var2
                OR A.var3 = B.var3
                OR A.var4 = B.var4
                OR A.var5 = B.var5
                OR A.var6 = B.var6
                OR A.var7 = B.var7
                OR (
                    A.var8 = B.var8
                    AND A.var9 = B.var9
                )
            )
        )
    )
"""

Doing this in pandas would be a nightmare.

I like to do most of my DataFrame massaging with the pandasql package. It lets you run SQL queries on top of the DataFrames in your local environment.

The problem with pandasql is it runs on a SQLite engine, so you can't do RIGHT or FULL OUTER joins.

So how do you approach this problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

北渚 2025-01-28 01:48:54

那么,您可以通过两个 LEFT 连接、一个条件和一个 UNION 来实现 FULL OUTER 连接。

首先,声明一个包含要检索的列的代码片段:

select_statement = """
    SELECT 
        A.var0
        , B.var1
        , COALESCE(A.var2, B.var2) as var2
"""

接下来,构建一个表示 A 中所有值为 NULL 的条件。我使用 DataFrame 中的列构建了我的:

where_a_is_null_statement = f"""
    WHERE 
    {" AND ".join(["A." + col + " is NULL" for col in A.columns])}
"""

现在,使用所有这些片段执行 2-LEFT-JOIN-with-UNION 技巧:

sqldf(f"""
    {select_statement}
    FROM A
    LEFT JOIN B
    {condition_statement}
    UNION
    {select_statement}
    FROM B
    LEFT JOIN A
    {condition_statement}
    {where_a_is_null_statement}
""")

Well you can achieve a FULL OUTER join with two LEFT joins, a condition, and a UNION.

First, declare a snippet with the columns you want to retrieve:

select_statement = """
    SELECT 
        A.var0
        , B.var1
        , COALESCE(A.var2, B.var2) as var2
"""

Next, build a condition that represents all values in A being NULL. I built mine using the columns in my DataFrame:

where_a_is_null_statement = f"""
    WHERE 
    {" AND ".join(["A." + col + " is NULL" for col in A.columns])}
"""

Now, do the 2-LEFT-JOIN-with-a-UNION trick using all of these snippets:

sqldf(f"""
    {select_statement}
    FROM A
    LEFT JOIN B
    {condition_statement}
    UNION
    {select_statement}
    FROM B
    LEFT JOIN A
    {condition_statement}
    {where_a_is_null_statement}
""")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文