具有多个 LEFT 连接的 COUNT
我在使用计数功能时遇到一些麻烦。问题是由左连接引起的,我不确定我做得是否正确。
变量包括:
- Customer_name(买家)
- Product_code(客户购买的产品)
- Store(客户购买的地点)
数据集为:
- Customer_df(客户购买的列表)客户及其购买的产品代码)
- Store1_df(商店 1 的每周产品代码列表)
- Store2_df(商店 2 的每天产品代码列表)
所需的最终输出: 我想要一个表:
- col1: Customer_name;
- col2:在商店 1 购买的商品数量;
- col3:在商店 2 购买的商品数量;
- 过滤器:日期范围
我的查询如下所示:
SELECT
DISTINCT
C_customer_name,
C.product_code,
COUNT(S1.product_code) AS s1_sales,
COUNT(S2.product_code) AS s2_sales,
FROM customer_df C
LEFT JOIN store1_df S1 USING(product_code)
LEFT JOIN store2_df S2 USING(product_code)
GROUP BY
customer_name, product_code
HAVING
S1_sales > 0
OR S2_sales > 0
我期望的输出如下:
Customer_name | Product_code | Store1_weekly_sales | Store2_weekly_sales |
---|---|---|---|
Luigi | 120012 | 4 | 8 |
James | 100022 | 6 | 10 |
但是,我得到:
Customer_name | Product_code | Store1_weekly_sales | Store2_weekly_sales |
---|---|---|---|
Luigi | 120012 | 290 | 60 |
James | 100022 | 290 | 60 |
当我代替 COUNT(product_code) 执行 COUNT(DSITINCT Product_code) 时,它会起作用,但我想避免这种情况,因为我希望能够在不同的时间跨度上进行聚合(例如,如果我确实计算不同的值并考虑到超过1 周的数据我不会得到正确的数字)
我的假设是:
- 我以错误的方式连接表格 连接
- 具有不同时间聚合的两个数据集时出现问题
我做错了什么?
I am having some troubles with a count function. The problem is given by a left join that I am not sure I am doing correctly.
Variables are:
- Customer_name (buyer)
- Product_code (what the customer buys)
- Store (where the customer buys)
The datasets are:
- Customer_df (list of customers and product codes of their purchases)
- Store1_df (list of product codes per week, for Store 1)
- Store2_df (list of product codes per day, for Store 2)
Final output desired:
I would like to have a table with:
- col1: Customer_name;
- col2: Count of items purchased in store 1;
- col3: Count of items purchased in store 2;
- Filters: date range
My query looks like this:
SELECT
DISTINCT
C_customer_name,
C.product_code,
COUNT(S1.product_code) AS s1_sales,
COUNT(S2.product_code) AS s2_sales,
FROM customer_df C
LEFT JOIN store1_df S1 USING(product_code)
LEFT JOIN store2_df S2 USING(product_code)
GROUP BY
customer_name, product_code
HAVING
S1_sales > 0
OR S2_sales > 0
The output I expect is something like this:
Customer_name | Product_code | Store1_weekly_sales | Store2_weekly_sales |
---|---|---|---|
Luigi | 120012 | 4 | 8 |
James | 100022 | 6 | 10 |
But instead, I get:
Customer_name | Product_code | Store1_weekly_sales | Store2_weekly_sales |
---|---|---|---|
Luigi | 120012 | 290 | 60 |
James | 100022 | 290 | 60 |
It works when instead of COUNT(product_code) I do COUNT(DSITINCT product_code) but I would like to avoid that because I would like to be able to aggregate on different timespans (e.g. if I do count distinct and take into account more than 1 week of data I will not get the right numbers)
My hypothesis are:
- I am joining the tables in the wrong way
- There is a problem when joining two datasets with different time aggregations
What am I doing wrong?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Philipxy 指出的原因很常见。您从数据中得到笛卡尔结果,从而使您的数字膨胀。为了简单起见,让我们考虑一个客户从两家商店购买一件商品。第一家商店有 3 次购买,第二家商店有 5 次购买。您的总计数为 3 * 5。这是因为第一个条目中的每个条目在第二个条目中也加入了相同的客户 ID。因此,第一次购买加入到第二家商店 1-5,然后第二次购买加入到第二家商店 1-5,您可以看到膨胀。因此,通过让每个商店预先查询每个客户的聚合,每个商店的每个客户最多将拥有一条记录(以及根据您期望的结果的每个产品)。
不需要 group by 或having,因为其各自预聚合中的所有条目将导致每个唯一组合最多 1 条记录。现在,至于您需要按日期范围进行过滤。我只需在每个 AllCustProducts、PQStore1 和 PQStore2 中添加一个 WHERE 子句。
The reason as Philipxy indicated is common. You are getting a Cartesian result from your data thus bloating your numbers. To simplify, lets consider just a single customer purchasing one item from two stores. The first store has 3 purchases, the second store has 5 purchases. Your total count is 3 * 5. This is because for each entry in the first is also joined by the same customer id in the second. So 1st purchase is joined to second store 1-5, then second purchase joined to second store 1-5 and you can see the bloat. So, by having each store pre-query the aggregates per customer will have AT MOST, one record per customer per store (and per product as per your desired outcome).
No need for a group by or having since all entries in their respective pre-aggregates will result in a maximum of 1 record per unique combination. Now, as for your needs to filter by date ranges. I would just add a WHERE clause within each of the AllCustProducts, PQStore1, and PQStore2.