SQL 确定最小连续访问天数?
以下用户历史记录表包含给定用户访问网站的每一天的一条记录(在 24 小时 UTC 时间段内)。 它有数千条记录,但每个用户每天只有一条记录。 如果用户当天没有访问该网站,则不会生成任何记录。
Id UserId CreationDate ------ ------ ------------ 750997 12 2009-07-07 18:42:20.723 750998 15 2009-07-07 18:42:20.927 751000 19 2009-07-07 18:42:22.283
我正在寻找的是对该表的 SQL 查询具有良好的性能,它告诉我哪些用户 ID 连续 (n) 天访问该网站而没有错过一天。
换句话说,有多少用户在此表中拥有 (n) 条具有连续日期(前天或后天)日期的记录? 如果序列中缺少任何一天,序列就会被破坏,并应从 1 重新开始; 我们正在寻找在此处连续停留天数且没有间断的用户。
当然,此查询与特定 Stack Overflow 徽章之间的任何相似之处纯属巧合。.:)
The following User History table contains one record for every day a given user has accessed a website (in a 24 hour UTC period). It has many thousands of records, but only one record per day per user. If the user has not accessed the website for that day, no record will be generated.
Id UserId CreationDate ------ ------ ------------ 750997 12 2009-07-07 18:42:20.723 750998 15 2009-07-07 18:42:20.927 751000 19 2009-07-07 18:42:22.283
What I'm looking for is a SQL query on this table with good performance, that tells me which userids have accessed the website for (n) continuous days without missing a day.
In other words, how many users have (n) records in this table with sequential (day-before, or day-after) dates? If any day is missing from the sequence, the sequence is broken and should restart again at 1; we're looking for users who have achieved a continuous number of days here with no gaps.
Any resemblance between this query and a particular Stack Overflow badge is purely coincidental, of course.. :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
这个想法是,如果我们有天数列表(作为数字)和 row_number,那么错过的天数会使这两个列表之间的偏移量稍微增加大。 所以我们正在寻找一个具有一致偏移的范围。
你可以在最后使用“ORDER BY NumConsecutiveDays DESC”,或者说“HAVING count(*) > 14”作为阈值......
我还没有测试过这个——只是把它写在我的脑海里。 希望能在 SQL2005 及更高版本中工作。
编辑:使用 COUNT(*) 的建议非常有效 - 我应该首先这样做,但并没有真正考虑。 以前它使用 datediff(day, min(CreationDate), max(CreationDate)) 代替。
How about (and please make sure the previous statement ended with a semi-colon):
The idea being that if we have list of the days (as a number), and a row_number, then missed days make the offset between these two lists slightly bigger. So we're looking for a range that has a consistent offset.
You could use "ORDER BY NumConsecutiveDays DESC" at the end of this, or say "HAVING count(*) > 14" for a threshold...
I haven't tested this though - just writing it off the top of my head. Hopefully works in SQL2005 and on.
...and would be very much helped by an index on tablename(UserID, CreationDate)
Edited: Turns out Offset is a reserved word, so I used TheOffset instead.
Edited: The suggestion to use COUNT(*) is very valid - I should've done that in the first place but wasn't really thinking. Previously it was using datediff(day, min(CreationDate), max(CreationDate)) instead.
[Jeff Atwood]这是一个很棒的快速解决方案,值得被接受,但是Rob Farley 的解决方案也非常出色,而且可以说更快(! )。 请您也检查一下!
The answer is obviously:
Okay here's my serious answer:
[Jeff Atwood] This is a great fast solution and deserves to be accepted, but Rob Farley's solution is also excellent and arguably even faster (!). Please check it out too!
结尾的连续天数。 在登录时更新表很容易(类似于您已经在做的事情,如果当天不存在任何行,您将检查前一天是否存在任何行。如果为 true,您将增加LongestStreak
在新行中,否则,您将其设置为 1。)添加此列后,查询将显而易见:
If you can change the table schema, I'd suggest adding a column
to the table which you'd set to the number of sequential days ending to theCreationDate
. It's easy to update the table at login time (similar to what you are doing already, if no rows exist of the current day, you'll check if any row exists for the previous day. If true, you'll increment theLongestStreak
in the new row, otherwise, you'll set it to 1.)The query will be obvious after adding this column:
一些很好表达的 SQL 代码如下:
假设您有一个 用户定义的聚合函数< /a> 类似的内容(注意这是错误的):
Some nicely expressive SQL along the lines of:
Assuming you have a user defined aggregate function something along the lines of (beware this is buggy):
似乎您可以利用这样一个事实:要连续 n 天需要有 n 行。
Seems like you could take advantage of the fact that to be continuous over n days would require there to be n rows.
So something like:
对我来说,使用单个 SQL 查询来完成此操作似乎过于复杂。 让我将这个答案分为两部分。
运行每日 cron 作业,检查每个用户今天是否已登录,如果登录则增加计数器,如果未登录则将其设置为 0。
- 将此表导出到不运行您的网站并且暂时不需要的服务器。 ;)
- 按用户排序,然后按日期排序。
- 按顺序浏览它,保留计数器......
Doing this with a single SQL query seems overly complicated to me. Let me break this answer down in two parts.
Run a daily cron job that checks for every user wether he has logged in today and then increments a counter if he has or sets it to 0 if he hasn't.
- Export this table to a server that doesn't run your website and won't be needed for a while. ;)
- Sort it by user, then date.
- go through it sequentially, keep a counter...
如果这对您来说非常重要,请获取此事件并驱动一个表来为您提供此信息。 不需要用那些疯狂的查询来杀死机器。
If this is so important to you, source this event and drive a table to give you this info. No need to kill the machine with all those crazy queries.
您可以使用递归 CTE (SQL Server 2005+):
You could use a recursive CTE (SQL Server 2005+):
Joe Celko 在 SQL for Smarties 中对此有一个完整的章节(称为运行和序列)。 我家里没有那本书,所以当我开始工作时......我实际上会回答这个问题。 (假设历史表名为 dbo.UserHistory 并且天数为 @Days)
另一个线索来自 SQL 团队关于运行的博客
我有另一个想法,但没有方便的 SQL 服务器可以在这里使用,那就是使用带有分区 ROW_NUMBER 的 CTE,例如这:
Joe Celko has a complete chapter on this in SQL for Smarties (calling it Runs and Sequences). I don't have that book at home, so when I get to work... I'll actually answer this. (assuming history table is called dbo.UserHistory and the number of days is @Days)
Another lead is from SQL Team's blog on runs
The other idea I've had, but don't have a SQL server handy to work on here is to use a CTE with a partitioned ROW_NUMBER like this:
The above is likely WAY HARDER than it has to be, but left as an a brain tickle for when you have some other definition of "a run" than just dates.
几个 SQL Server 2012 选项 (下面假设 N=100)。
A couple of SQL Server 2012 options (assuming N=100 below).
Though with my sample data the following worked out more efficient
Both rely on the constraint stated in the question that there is at most one record per day per user.
Something like this?
我使用一个简单的数学属性来识别谁连续访问了该网站。 这个属性是你第一次访问和最后一次访问之间的天差应该等于访问表日志中的记录数。
以下是我在 Oracle DB 中测试的 SQL 脚本(它也应该在其他数据库中工作):
I used a simple math property to identify who consecutively accessed the site. This property is that you should have the day difference between the first time access and last time equal to number of records in your access table log.
Here are SQL script that I tested in Oracle DB (it should work in other DBs as well):
Table prep script:
cast(convert(char(11), @startdate, 113) as datetime)
列已建立索引。我刚刚意识到这不会告诉您所有用户及其连续总天数。 但会告诉您哪些用户将从您选择的日期起在设定的天数内访问过。
我已经检查过这一点,它将查询所有用户和所有日期。 它基于 Spencer 的第一个(笑话?)解决方案,但我的有效。
The statement
cast(convert(char(11), @startdate, 113) as datetime)
removes the time part of the date so we start at midnight.I would assume also that the
columns are indexed.I just realized that this won't tell you all the users and their total consecutive days. But will tell you which users will have been visiting a set number of days from a date of your choosing.
Revised solution:
I've checked this and it will query for all users and all dates. It is based on Spencer's 1st (joke?) solution, but mine works.
Update: improved the date handling in the second solution.
这应该可以满足您的要求,但我没有足够的数据来测试效率。 复杂的 CONVERT/FLOOR 内容是从日期时间字段中删除时间部分。 如果您使用的是 SQL Server 2008,则可以使用 CAST(x.CreationDate AS DATE)。
This should do what you want but I don't have enough data to test efficiency. The convoluted CONVERT/FLOOR stuff is to strip the time portion off the datetime field. If you're using SQL Server 2008 then you could use CAST(x.CreationDate AS DATE).
Creation script
Spencer almost did it, but this should be the working code:
我突然想到 MySQLish:
未经测试,几乎肯定需要对 MSSQL 进行一些转换,但我认为这可以提供一些想法。
Off the top of my head, MySQLish:
Untested, and almost certainly needs some conversion for MSSQL, but I think that give some ideas.
使用理货表怎么样? 它遵循更加算法化的方法,执行计划变得轻而易举。 用您想要扫描表的从 1 到“MaxDaysBehind”的数字填充tallyTable(即 90 将查找 3 个月后的数据等)。
How about one using Tally tables? It follows a more algorithmic approach, and execution plan is a breeze. Populate the tallyTable with numbers from 1 to 'MaxDaysBehind' that you want to scan the table (ie. 90 will look for 3 months behind,etc).
稍微调整一下比尔的查询。 您可能必须在分组之前截断日期以仅计算每天一次登录...
已编辑以使用 DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) 而不是 Convert( char(10) , CreationDate, 101 ) 。
我本来想早点使用 datepart 但我懒得查找语法所以我想 id 使用 Convert 代替。 我不知道它有重大影响谢谢! 现在我明白了。
Tweaking Bill's query a bit. You might have to truncate the date before grouping to count only one login per day...
EDITED to use DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) instead of convert( char(10) , CreationDate, 101 ).
I was looking to use datepart earlier but i was too lazy to look up the syntax so i figured i d use convert instead. I dint know it had a significant impact Thanks! now i know.
assuming a schema that goes like:
this will extract contiguous ranges from a date sequence with gaps.