如何返回缺失数据计数为 0 的行？

发布于 2024-07-10 15:24:48 字数 1966 浏览 13 评论 0原文

我有几年（2003-2008）的数据分布不均匀（关于日期）。我想查询给定的开始和结束日期集的数据，按 PostgreSQL 8.3。

问题在于，某些查询在所需时间段内给出连续的结果，如下所示：

select to_char(date_trunc('month',date), 'YYYY-MM-DD'), count(distinct post_id) 
from some_table
where category_id = 1
and entity_id = 77
and entity2_id = 115 
and date <= '2008-12-06'
and date >= '2007-12-01'
group by date_trunc('month',date)
order by date_trunc('month',date);

  to_char   | count 
------------+-------
 2007-12-01 |    64
 2008-01-01 |    31
 2008-02-01 |    14
 2008-03-01 |    21
 2008-04-01 |    28
 2008-05-01 |    44
 2008-06-01 |   100
 2008-07-01 |    72
 2008-08-01 |    91
 2008-09-01 |    92
 2008-10-01 |    79
 2008-11-01 |    65
(12 rows)

但其中一些查询由于不存在数据而错过了一些间隔，如以下查询：

select to_char(date_trunc('month',date), 'YYYY-MM-DD'), count(distinct post_id) 
from some_table
where category_id=1
and entity_id = 75
and entity2_id = 115 
and date <= '2008-12-06'
and date >= '2007-12-01'
group by date_trunc('month',date)
order by date_trunc('month',date);
    
  to_char   | count 
------------+-------

 2007-12-01 |     2
 2008-01-01 |     2
 2008-03-01 |     1
 2008-04-01 |     2
 2008-06-01 |     1
 2008-08-01 |     3
 2008-10-01 |     2
(7 rows)

所需结果集为：

  to_char   | count 
------------+-------
 2007-12-01 |     2
 2008-01-01 |     2
 2008-02-01 |     0
 2008-03-01 |     1
 2008-04-01 |     2
 2008-05-01 |     0
 2008-06-01 |     1
 2008-07-01 |     0
 2008-08-01 |     3
 2008-09-01 |     0
 2008-10-01 |     2
 2008-11-01 |     0
(12 rows)

缺失计数为 0条目。

我之前看过 Stack Overflow 上的讨论，但它们似乎没有解决我的问题，因为我的分组周期是（日、周、月、季度、年）之一，并由应用程序决定运行时。因此，我想像使用日历表或序列表进行左连接这样的方法不会有帮助。

我当前的解决方案是使用日历模块来填补 Python 中的这些空白（在 Turbogears 应用程序中）。

有一个更好的方法吗？

原文

I have unevenly distributed data (wrt date) for a few years (2003-2008). I want to query data for a given set of start and end date, grouping the data by any of the supported intervals (day, week, month, quarter, year) in PostgreSQL 8.3.

The problem is that some of the queries give results continuous over the required period, as this one:

select to_char(date_trunc('month',date), 'YYYY-MM-DD'), count(distinct post_id) 
from some_table
where category_id = 1
and entity_id = 77
and entity2_id = 115 
and date <= '2008-12-06'
and date >= '2007-12-01'
group by date_trunc('month',date)
order by date_trunc('month',date);

  to_char   | count 
------------+-------
 2007-12-01 |    64
 2008-01-01 |    31
 2008-02-01 |    14
 2008-03-01 |    21
 2008-04-01 |    28
 2008-05-01 |    44
 2008-06-01 |   100
 2008-07-01 |    72
 2008-08-01 |    91
 2008-09-01 |    92
 2008-10-01 |    79
 2008-11-01 |    65
(12 rows)

But some of them miss some intervals because there is no data present, as this one:

select to_char(date_trunc('month',date), 'YYYY-MM-DD'), count(distinct post_id) 
from some_table
where category_id=1
and entity_id = 75
and entity2_id = 115 
and date <= '2008-12-06'
and date >= '2007-12-01'
group by date_trunc('month',date)
order by date_trunc('month',date);
    
  to_char   | count 
------------+-------

 2007-12-01 |     2
 2008-01-01 |     2
 2008-03-01 |     1
 2008-04-01 |     2
 2008-06-01 |     1
 2008-08-01 |     3
 2008-10-01 |     2
(7 rows)

where the required resultset is:

  to_char   | count 
------------+-------
 2007-12-01 |     2
 2008-01-01 |     2
 2008-02-01 |     0
 2008-03-01 |     1
 2008-04-01 |     2
 2008-05-01 |     0
 2008-06-01 |     1
 2008-07-01 |     0
 2008-08-01 |     3
 2008-09-01 |     0
 2008-10-01 |     2
 2008-11-01 |     0
(12 rows)

A count of 0 for missing entries.

I have seen earlier discussions on Stack Overflow but they don't solve my problem it seems, since my grouping period is one of (day, week, month, quarter, year) and decided on runtime by the application. So an approach like left join with a calendar table or sequence table will not help I guess.

My current solution to this is to fill in these gaps in Python (in a Turbogears App) using the calendar module.

Is there a better way to do this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

记忆之渊 2024-07-17 15:24:48

^{这个问题很老了。但由于其他用户选择它作为新副本的主控，我添加了一个正确的答案。}

正确的解决方案

SELECT *
FROM  (
   SELECT day::date
   FROM   generate_series(timestamp '2007-12-01'
                        , timestamp '2008-12-01'
                        , interval  '1 month') day
   ) d
LEFT   JOIN (
   SELECT date_trunc('month', date_col)::date AS day
        , count(*) AS some_count
   FROM   tbl
   WHERE  date_col >= date '2007-12-01'
   AND    date_col <= date '2008-12-06'
-- AND    ... more conditions
   GROUP  BY 1
   ) t USING (day)
ORDER  BY day;

当然使用LEFT JOIN。

generate_series() 可以生成一个表格动态时间戳，而且速度非常快。请参阅：

在 PostgreSQL 中生成两个日期之间的时间序列

在您加入之前聚合通常会更快。小提琴测试用例的相关答案：

PostgreSQL - 按数组排序

将 timestamp 转换为 date (::date) 以获取基本格式。有关更多信息，请使用 to_char()。

GROUP BY 1 是引用第一个输出列的语法简写。也可以是GROUP BY day，但这可能与现有的同名列冲突。或者 GROUP BY date_trunc('month', date_col)::date 但这对我来说太长了。

使用 date_trunc( ）。

count() 永远不会返回 NULL- 0 表示没有行 - 但 LEFT JOIN 可以。要在外部 SELECT 中返回 0 而不是 NULL，请使用 COALESCE(some_count, 0) AS some_count。手册。

对于 <强>更通用的解决方案或任意时间间隔请参阅：

按任意时间间隔计算行数的最佳方法

^{This question is old. But since fellow users picked it as master for a new duplicate I am adding a proper answer.}

Proper solution

SELECT *
FROM  (
   SELECT day::date
   FROM   generate_series(timestamp '2007-12-01'
                        , timestamp '2008-12-01'
                        , interval  '1 month') day
   ) d
LEFT   JOIN (
   SELECT date_trunc('month', date_col)::date AS day
        , count(*) AS some_count
   FROM   tbl
   WHERE  date_col >= date '2007-12-01'
   AND    date_col <= date '2008-12-06'
-- AND    ... more conditions
   GROUP  BY 1
   ) t USING (day)
ORDER  BY day;

Use LEFT JOIN, of course.

generate_series() can produce a table of timestamps on the fly, and very fast. See:

Generating time series between two dates in PostgreSQL

It's typically faster to aggregate before you join. Related answer with test case in a fiddle:

PostgreSQL - order by an array

Cast the timestamp to date (::date) for a basic format. For more use to_char().

GROUP BY 1 is syntax shorthand to reference the first output column. Could be GROUP BY day as well, but that might conflict with an existing column of the same name. Or GROUP BY date_trunc('month', date_col)::date but that's too long for my taste.

Works with the available interval arguments for date_trunc().

count() never returns NULL - 0 for no rows - but the LEFT JOIN does.
To return 0 instead of NULL in the outer SELECT, use COALESCE(some_count, 0) AS some_count. The manual.

For a more generic solution or arbitrary time intervals see:

Best way to count rows by arbitrary time intervals

回复收藏 0 原文

怀里藏娇 2024-07-17 15:24:48

您可以创建去年（例如）所有第一天的列表，

select distinct date_trunc('month', (current_date - offs)) as date 
from generate_series(0,365,28) as offs;
          date
------------------------
 2007-12-01 00:00:00+01
 2008-01-01 00:00:00+01
 2008-02-01 00:00:00+01
 2008-03-01 00:00:00+01
 2008-04-01 00:00:00+02
 2008-05-01 00:00:00+02
 2008-06-01 00:00:00+02
 2008-07-01 00:00:00+02
 2008-08-01 00:00:00+02
 2008-09-01 00:00:00+02
 2008-10-01 00:00:00+02
 2008-11-01 00:00:00+01
 2008-12-01 00:00:00+01

然后您可以加入该系列。

You can create the list of all first days of the last year (say) with

select distinct date_trunc('month', (current_date - offs)) as date 
from generate_series(0,365,28) as offs;
          date
------------------------
 2007-12-01 00:00:00+01
 2008-01-01 00:00:00+01
 2008-02-01 00:00:00+01
 2008-03-01 00:00:00+01
 2008-04-01 00:00:00+02
 2008-05-01 00:00:00+02
 2008-06-01 00:00:00+02
 2008-07-01 00:00:00+02
 2008-08-01 00:00:00+02
 2008-09-01 00:00:00+02
 2008-10-01 00:00:00+02
 2008-11-01 00:00:00+01
 2008-12-01 00:00:00+01

Then you can join with that series.

回复收藏 0 原文