当前位置：文江博客话题详情

具有多个 LEFT 连接的 COUNT

发布于 2025-01-12 05:42:28 字数 2395 浏览 3 评论 0原文

我在使用计数功能时遇到一些麻烦。问题是由左连接引起的，我不确定我做得是否正确。

变量包括：

Customer_name（买家）
Product_code（客户购买的产品）
Store（客户购买的地点）

数据集为：

Customer_df（客户购买的列表）客户及其购买的产品代码）
Store1_df（商店 1 的每周产品代码列表）
Store2_df（商店 2 的每天产品代码列表）

所需的最终输出：我想要一个表：

col1: Customer_name;
col2：在商店 1 购买的商品数量；
col3：在商店 2 购买的商品数量；
过滤器：日期范围

我的查询如下所示：

SELECT
DISTINCT
C_customer_name,
C.product_code,
COUNT(S1.product_code) AS s1_sales,
COUNT(S2.product_code) AS s2_sales,

FROM customer_df C
LEFT JOIN store1_df S1 USING(product_code)
LEFT JOIN store2_df S2 USING(product_code)

GROUP BY 
customer_name, product_code

HAVING
S1_sales > 0
OR S2_sales > 0

我期望的输出如下：

Customer_name	Product_code	Store1_weekly_sales	Store2_weekly_sales
Luigi	120012	4	8
James	100022	6	10

但是，我得到：

Customer_name	Product_code	Store1_weekly_sales	Store2_weekly_sales
Luigi	120012	290	60
James	100022	290	60

当我代替 COUNT(product_code) 执行 COUNT(DSITINCT Product_code) 时，它会起作用，但我想避免这种情况，因为我希望能够在不同的时间跨度上进行聚合（例如，如果我确实计算不同的值并考虑到超过1 周的数据我不会得到正确的数字）

我的假设是：

我以错误的方式连接表格连接
具有不同时间聚合的两个数据集时出现问题

我做错了什么？

原文

I am having some troubles with a count function. The problem is given by a left join that I am not sure I am doing correctly.

Variables are:

Customer_name (buyer)
Product_code (what the customer buys)
Store (where the customer buys)

The datasets are:

Customer_df (list of customers and product codes of their purchases)
Store1_df (list of product codes per week, for Store 1)
Store2_df (list of product codes per day, for Store 2)

Final output desired:
I would like to have a table with:

col1: Customer_name;
col2: Count of items purchased in store 1;
col3: Count of items purchased in store 2;
Filters: date range

My query looks like this:

SELECT
DISTINCT
C_customer_name,
C.product_code,
COUNT(S1.product_code) AS s1_sales,
COUNT(S2.product_code) AS s2_sales,

FROM customer_df C
LEFT JOIN store1_df S1 USING(product_code)
LEFT JOIN store2_df S2 USING(product_code)

GROUP BY 
customer_name, product_code

HAVING
S1_sales > 0
OR S2_sales > 0

The output I expect is something like this:

Customer_name	Product_code	Store1_weekly_sales	Store2_weekly_sales
Luigi	120012	4	8
James	100022	6	10

But instead, I get:

Customer_name	Product_code	Store1_weekly_sales	Store2_weekly_sales
Luigi	120012	290	60
James	100022	290	60

It works when instead of COUNT(product_code) I do COUNT(DSITINCT product_code) but I would like to avoid that because I would like to be able to aggregate on different timespans (e.g. if I do count distinct and take into account more than 1 week of data I will not get the right numbers)

My hypothesis are:

I am joining the tables in the wrong way
There is a problem when joining two datasets with different time aggregations

What am I doing wrong?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

所谓喜欢 2025-01-19 05:42:28

Philipxy 指出的原因很常见。您从数据中得到笛卡尔结果，从而使您的数字膨胀。为了简单起见，让我们考虑一个客户从两家商店购买一件商品。第一家商店有 3 次购买，第二家商店有 5 次购买。您的总计数为 3 * 5。这是因为第一个条目中的每个条目在第二个条目中也加入了相同的客户 ID。因此，第一次购买加入到第二家商店 1-5，然后第二次购买加入到第二家商店 1-5，您可以看到膨胀。因此，通过让每个商店预先查询每个客户的聚合，每个商店的每个客户最多将拥有一条记录（以及根据您期望的结果的每个产品）。

select
      c.customer_name,
      AllCustProducts.Product_Code,
      coalesce( PQStore1.SalesEntries, 0 ) Store1SalesEntries,
      coalesce( PQStore2.SalesEntries, 0 ) Store2SalesEntries
   from
      customer_df c
         -- now, we need all possible UNIQUE instances of 
         -- a given customer and product to prevent duplicates
         -- for subsequent queries of sales per customer and store
         JOIN
         ( select distinct customerid, product_code
              from store1_df 
           union
           select distinct customerid, product_code
              from store2_df ) AllCustProducts
            on c.customerid = AllCustProducts.customerid
            -- NOW, we can join to a pre-query of sales at store 1
            -- by customer id and product code.  You may also want to
            -- get sum( SalesDollars ) if available, just add respectively
            -- to each sub-query below.
            LEFT JOIN
            ( select
                    s1.customerid,
                    s1.product_code,
                    count(*) as SalesEntries
                 from
                    store1_df s1
                 group by
                    s1.customerid,
                    s1.product_code ) PQStore1
              on AllCustProducts.customerid = PQStore1.customerid
              AND AllCustProducts.product_code = PQStore1.product_code
            -- now, same pre-aggregation to store 2
            LEFT JOIN
            ( select
                    s2.customerid,
                    s2.product_code,
                    count(*) as SalesEntries
                 from
                    store2_df s2
                 group by
                    s2.customerid,
                    s2.product_code ) PQStore2
              on AllCustProducts.customerid = PQStore2.customerid
              AND AllCustProducts.product_code = PQStore2.product_code

不需要 group by 或having，因为其各自预聚合中的所有条目将导致每个唯一组合最多 1 条记录。现在，至于您需要按日期范围进行过滤。我只需在每个 AllCustProducts、PQStore1 和 PQStore2 中添加一个 WHERE 子句。

The reason as Philipxy indicated is common. You are getting a Cartesian result from your data thus bloating your numbers. To simplify, lets consider just a single customer purchasing one item from two stores. The first store has 3 purchases, the second store has 5 purchases. Your total count is 3 * 5. This is because for each entry in the first is also joined by the same customer id in the second. So 1st purchase is joined to second store 1-5, then second purchase joined to second store 1-5 and you can see the bloat. So, by having each store pre-query the aggregates per customer will have AT MOST, one record per customer per store (and per product as per your desired outcome).

select
      c.customer_name,
      AllCustProducts.Product_Code,
      coalesce( PQStore1.SalesEntries, 0 ) Store1SalesEntries,
      coalesce( PQStore2.SalesEntries, 0 ) Store2SalesEntries
   from
      customer_df c
         -- now, we need all possible UNIQUE instances of 
         -- a given customer and product to prevent duplicates
         -- for subsequent queries of sales per customer and store
         JOIN
         ( select distinct customerid, product_code
              from store1_df 
           union
           select distinct customerid, product_code
              from store2_df ) AllCustProducts
            on c.customerid = AllCustProducts.customerid
            -- NOW, we can join to a pre-query of sales at store 1
            -- by customer id and product code.  You may also want to
            -- get sum( SalesDollars ) if available, just add respectively
            -- to each sub-query below.
            LEFT JOIN
            ( select
                    s1.customerid,
                    s1.product_code,
                    count(*) as SalesEntries
                 from
                    store1_df s1
                 group by
                    s1.customerid,
                    s1.product_code ) PQStore1
              on AllCustProducts.customerid = PQStore1.customerid
              AND AllCustProducts.product_code = PQStore1.product_code
            -- now, same pre-aggregation to store 2
            LEFT JOIN
            ( select
                    s2.customerid,
                    s2.product_code,
                    count(*) as SalesEntries
                 from
                    store2_df s2
                 group by
                    s2.customerid,
                    s2.product_code ) PQStore2
              on AllCustProducts.customerid = PQStore2.customerid
              AND AllCustProducts.product_code = PQStore2.product_code

No need for a group by or having since all entries in their respective pre-aggregates will result in a maximum of 1 record per unique combination. Now, as for your needs to filter by date ranges. I would just add a WHERE clause within each of the AllCustProducts, PQStore1, and PQStore2.

回复收藏 0 原文

~没有更多了~