使用统计表选择重复日期的查询
我有一个事件表,其中一个字段指定事件发生的频率(以天为单位)。我想选择给定日期范围内发生的所有事件,包括计算的发生次数(例如,如果第一个事件日期是 2011 年 1 月 6 日,并且每 7 天发生一次,您会在结果)。
这是我的事件表的样子:
event_ID INT,
event_title NVARCHAR(50),
first_event_date DATETIME,
occurs_every INT
阅读这篇文章后,似乎就像处理这个问题的最有效方法是使用计数表,但我一直无法弄清楚如何返回我正在寻找的结果。
假设我的数据如下所示:
event_ID | event_title | first_event_date | occurs_every 1 | Event 1 | 1/6/2011 | 7 2 | Event 2 | 1/8/2011 | 3
我正在寻找的结果是:
event_ID | event_title | event_date | 1 | Event 1 | 1/6/2011 | 2 | Event 2 | 1/8/2011 | 1 | Event 1 | 1/13/2011 | 2 | Event 2 | 1/12/2011 | 2 | Event 2 | 1/16/2011 | 1 | Event 1 | 1/20/2011 | (etc)
有什么建议吗?编辑:我正在使用 SQL Server 2008。
其他信息:
我有一个有效的查询,但它看起来相当混乱,而且一旦我将更多数据放入表中,我就会担心性能。
首先,作为参考,这是 Tally 表:
SELECT TOP 11000
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
现在,这是一个笨拙的选择查询:
SELECT event_ID,
event_title,
first_event_date,
DATEADD(dd, occurs_every * ( t.N - 1 ), [first_event_date]) AS occurrence
FROM dbo.Events
CROSS JOIN dbo.Tally t
WHERE DATEADD(dd, occurs_every * ( t.N - 1 ), [first_event_date]) <= '03-01-2011
ORDER BY occurrence`
现在,这可行 - 但当我向表中添加 1000 行示例数据时,它确实陷入了困境。我认为这是我的交叉连接。
I have a table for events, with a field that specifies how frequently the event occurs (in days). I'd like to select all occurrences of the event within a given date range, including the calculated occurrences (e.g. if the first event date is Jan 6 2011 and it occurs every 7 days, you'd see Jan 13 and Jan 20 in the results).
Here's what my events table looks like:
event_ID INT,
event_title NVARCHAR(50),
first_event_date DATETIME,
occurs_every INT
After reading this article, it seems like the most efficient way to handle this is with a tally table, but I haven't been able to wrap my head around how to return the results I'm looking for.
Let's say I have data that looks like this:
event_ID | event_title | first_event_date | occurs_every 1 | Event 1 | 1/6/2011 | 7 2 | Event 2 | 1/8/2011 | 3
The results I'm looking for would be:
event_ID | event_title | event_date | 1 | Event 1 | 1/6/2011 | 2 | Event 2 | 1/8/2011 | 1 | Event 1 | 1/13/2011 | 2 | Event 2 | 1/12/2011 | 2 | Event 2 | 1/16/2011 | 1 | Event 1 | 1/20/2011 | (etc)
Any suggestions? Edit: I'm using SQL Server 2008.
Additional info:
I've got a working query, but it seems pretty kludgy and I'm concerned about the performance once I get more data into the table.
First, for reference, this is the Tally table:
SELECT TOP 11000
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
Now, here's the kludgy select query:
SELECT event_ID,
event_title,
first_event_date,
DATEADD(dd, occurs_every * ( t.N - 1 ), [first_event_date]) AS occurrence
FROM dbo.Events
CROSS JOIN dbo.Tally t
WHERE DATEADD(dd, occurs_every * ( t.N - 1 ), [first_event_date]) <= '03-01-2011
ORDER BY occurrence`
Now, this works - but when I added 1000 rows of sample data to the table it really bogged down. I assume that's my cross join.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在 SQL Server 2008 中,您可以使用递归 CTE。
您必须记住按日期范围进行过滤,这样就不会陷入无限循环。或者使用 MAXRECURSION 提示来限制结果(默认情况下该值为 100)
In SQL Server 2008 you can use a recursive CTE.
You have to remember to filter by the date range, so it doesn't get in a infinite loop. Or use the
MAXRECURSION
hint to limit the results (by default this value is 100)首先,请接受我最诚挚的歉意,因为我没有回到这篇文章。我发表了一些评论作为序言,并全心全意地想稍后发布一个有用的答案,而不仅仅是“明智的建议”,然后现实生活发生了,我完全失去了这篇文章的踪迹。
让我们首先回顾一下OP的帖子,构建他所说的他正在使用的表格,并像他所说的那样用一千个事件填充它。我将使用 2015 年和 2016 年的随机开始日期对数据进行一些现代化,使用高性能“伪游标”来提供我们需要的“行的存在”,而不是 While 循环或 rCTE(递归 CTE)的 RBAR )。
顺便说一句,我保持所有 2005 兼容,因为仍然有很多人使用 2005,并且使用 2008+ 技术对此没有任何性能增益。
这是构建测试表的代码。详情见评论。
以下是前 10 行的外观,因为我们知道,由于我用于生成约束随机数据的方法,first_even_datet 和occurrence_every 的值将完全不同。
可以肯定的是,您将需要一个计数表来复制 OP 实验。这是代码。如果您已经有一个,请确保它具有所需的唯一聚集索引(通常以 PK 的形式),以提高性能。我已经对代码的“伪光标”部分中的行源表进行了现代化改造,以不使用已弃用的“syscolumns”视图。
我们准备好摇滚了。 OP 代码的一部分被论坛吞没,但我能够使用他的原始帖子的编辑来恢复它。它实际上看起来像这样,只是我更改了“结束日期”以匹配我刚刚生成的数据(这是我所做的唯一更改)。由于代码不包含标量或多语句 UDF,因此我还打开了统计数据来尝试解释发生了什么。
这是 OP 的代码,其中包含提到的更改。
以下是运行 OP 代码的统计数据。很抱歉所有的滚动,但它们排得很长。
显然,这种性能所发出的声音甚至连 While 循环或 rCTE 都可以击败。问题是什么?
如果您查看下面执行计划中突出显示的箭头,您会发现由于非 SARGable(SARG =“Search ARGument”并且非 SARGable 意味着它无法正确使用索引)条件,它包含 1100 万个实际行这导致 11,000 行 Tally 表和 1,000 行 #Events 表之间发生完全交叉连接。伙计们,这些是实际行,而不是估计行。
原因是 Tally Table 的“N”列用于公式中,并且必须扫描整个 Tally Table 作为每次结果#Events 表中的行。这是一个常见错误,使人们认为计数表会产生缓慢的代码。
那么,我们该如何解决呢?我们不使用 tN 来计算每行的日期,而是用日期之差除以天数来计算出 tN 相等所需的出现次数,然后看看会发生什么。请注意,我在下面的代码中唯一更改的是 WHERE 子句中的条件,用于在 tN SARGable 上进行查找(能够使用索引来启动和停止搜索,然后进行范围扫描)。
新的执行计划如下所示。 61,766 行实际行(全部在缓存中)与 1100 万行有很大不同。
以下是计算天堂那一小部分的统计数据。
代码更改总量... WHERE 子句的 1 行。
通过使用 Itzik Ben-Gan 的内联级联 CTE(不是 rCTE)之一,我们也可以将读取总数降至 7。
最重要的是,虽然计数表的使用几乎是提高性能的灵丹妙药,但您必须像其他任何东西一样正确使用它。您必须使用“最佳实践”,例如编写 SARGable WHERE 子句以正确地将其提供给我们索引,就像其他任何事情一样。
再次,我最诚挚的歉意,特别是向OP,因为这么晚才这样做。我希望它能帮助将来的人。我也很抱歉没有时间重写该线程上的 rCTE 示例以显示它有多糟糕。如果您对为什么 rCTE 如此糟糕感兴趣并且不介意成为 SQLServerCentral.com 会员,那么这里有一篇关于该主题的文章。我本来想把所有内容都贴在这里,但是太长了。
隐藏 RBAR:使用递归 CTE 进行计数
First, please accept my most sincere apologies for not getting back to this post. I made a couple of comments as a preamble and with full intent to post a useful answer later instead of just “sage advice” and then real life happened and I totally lost track of this post.
Let’s first revisit the OP’s post by building the table he said he was using and populating it with a thousand events like he said he did. I’ll modernize the data a bit by using random start dates for 2015 and 2016 using a high performance “pseudo cursor” to provide the “presence of rows” that we need instead of the RBAR of either a While Loop or rCTE (Recursive CTE).
As a bit of a side bar, I’m keeping everything 2005 compatible because there are still a whole lot of people using 2005 and there’s no performance gain in using 2008+ techniques for this.
Here’s the code to build the test table. Details are in the comments.
Here are what the first 10 rows will look like with the understanding that the values for first_even_datet and occurs_every will be quite different because of the methods I used to generate constrained random data.
Just to be sure, you're going to need a Tally Table to duplicate the OPs experiment. Here's the code for that. If you already have one, make sure that it had the required unique clustered index (usually in the form of a PK) for performance reasons. I have modernized the row-source tables in the "pseudo-cursor" part of the code to NOT use the deprecated "syscolumns" view.
We're ready to rock. A part of the OP's code got swallowed by the forum but I was able to recover it using an edit of his original post. It actually looks like this except that I changed the "end date" to match the data that I just generated (and that's the only change I made). Since the code contains no scalar or multi-statement UDFs, I also turned on statistics to try to explain what's going on.
Here's the OP's code with the changes mentioned.
Here are the stats from running the OPs code. Sorry about all the scrolling but they're long lines.
Obviously, that performance is making sucking sounds that even a While Loop or rCTE could beat. What IS the problem?
If you check out the highlighted arrow in the execution plan below, you find that it contains 11 MILLION actual rows because of the non-SARGable (SARG = "Search ARGument" and non-SARGable means it can't use an index properly) criteria that caused a full CROSS JOIN between the 11,000 row Tally Table and the 1,000 row #Events table. And those are ACTUAL rows, not ESTIMATED rows, folks.
The reason is because the "N" column of the Tally Table is used in a formula and the entire Tally Table must be scanned as a result for every row in the #Events table. This is a common error that makes people thing that Tally Tables produce slow code.
So, how do we fix it? Rather than using t.N to calculate a date for each row, let's take the difference of dates and divide by the number of days to figure out the number of occurrences needed to equate t.N to and see what happens. Note that the only thing I changed in the code below was the criteria in the WHERE clause to make the lookup on t.N SARGable (able to use an index to start and stop a seek followed by a range scan).
Here's what the new execution plan looks like. 61,766 rows actual rows (all in cache) is a whole lot different than 11 MILLION rows.
Here are what the stats on that little slice of computational heaven look like.
Total amount of code changed... 1 line of the WHERE clause.
We could lower the total number of reads to just 7 by using one of Itzik Ben-Gan's inline cascading CTEs (is NOT an rCTE), as well.
The bottom line is that while the use of a Tally Table is nearly a panacea of performance, you do have to use it correctly, just like anything else. You have to use "Best Practices", such as writing a SARGable WHERE clause to get it to us the index correctly, just like anything else.
Again, my most sincere apologies, especially to the OP, for being so late with this. I hope it will help someone in the future. I also apologize for not having the time to rewrite the rCTE example on this thread to show how bad it can be. If you're interested in why rCTEs are so bad and you don't mind SQLServerCentral.com membership, then here's an article on the subject. I'd post all that here but it's too long to do so.
Hidden RBAR: Counting with Recursive CTE's
这是使用Oracle的一种方法(您可以通过修改生成连续数字的子查询将其切换到其他引擎,见下文)。该查询背后的想法是生成一个连续的乘数列表(例如0、1、2、3...、n),直到窗口大小(日期之间的天数)。这就是子查询返回的内容。我们使用它与事件表交叉连接,然后将结果限制在请求的日期范围内。
查询中的tally_table 是您在问题中指定的表。
Here is one method using Oracle (you can switch this to other engines by modifying the sub-query that generates consecutive numbers, see below). The idea behind this query is to generate a consecutive list of multipliers (e.g. 0, 1, 2, 3..., n) up to the window size (days between dates). This is what the sub-query returns. We use this to cross join with the event table and then limit the results to the requested date range.
The tally_table in the query is the table you specified in your question.