MySQL GROUP BY DateTime +/- 3 秒
假设我有一个包含 3 列的表:
- id (PK, int)
- timestamp (datetime)
- title (text)
我有以下记录:
1, 2010-01-01 15:00:00, Some Title
2, 2010-01-01 15:00:02, Some Title
3, 2010-01-02 15:00:00, Some Title
我需要对彼此相差 3 秒以内的记录进行 GROUP BY。对于此表,第 1 行和第 2 行将分组在一起。
这里有一个类似的问题: Mysql DateTime group by 15 mins
我也发现这个: http://www.artfulsoftware.com/infotree/queries.php#106
我不知道如何将这些方法转换为可以工作几秒钟的方法。 SO问题上的方法的问题在于,在我看来,它只适用于落在从已知点开始的时间段内的记录。例如,如果我要让 FLOOR()
以秒为单位,间隔 5 秒,时间 15:00:04 将与 15:00:01 分组,但不会以 15:00:06 分组。
这有道理吗?如果需要进一步说明,请告诉我。
编辑:对于数字集 {1, 2, 3, 4, 5, 6, 7, 50, 51, 60},似乎最好将它们分组为 {1, 2 , 3, 4, 5, 6, 7}, {50, 51}, {60},以便每个分组行取决于该行是否在前一行的 3 秒内。我知道这会改变一些事情,我很抱歉对此犹豫不决。
我正在尝试模糊匹配来自不同服务器的日志。服务器 #1 可能会记录一个项目“项目 #1”,而服务器 #2 将在服务器 #1 的几秒钟内记录同一项目“项目 #1”。我需要在两个日志行上执行一些聚合函数。不幸的是,由于服务器软件的性质,我只有继续下去的权利。
Suppose I have a table with 3 columns:
- id (PK, int)
- timestamp (datetime)
- title (text)
I have the following records:
1, 2010-01-01 15:00:00, Some Title
2, 2010-01-01 15:00:02, Some Title
3, 2010-01-02 15:00:00, Some Title
I need to do a GROUP BY records that are within 3 seconds of each other. For this table, rows 1 and 2 would be grouped together.
There is a similar question here: Mysql DateTime group by 15 mins
I also found this: http://www.artfulsoftware.com/infotree/queries.php#106
I don't know how to convert these methods into something that will work for seconds. The trouble with the method on the SO question is that it seems to me that it would only work for records falling within a bin of time that starts at a known point. For instance, if I were to get FLOOR()
to work with seconds, at an interval of 5 seconds, a time of 15:00:04 would be grouped with 15:00:01, but not grouped with 15:00:06.
Does this make sense? Please let me know if further clarification is needed.
EDIT: For the set of numbers, {1, 2, 3, 4, 5, 6, 7, 50, 51, 60}, it seems it might be best to group them {1, 2, 3, 4, 5, 6, 7}, {50, 51}, {60}, so that each grouping row depends on if the row is within 3 seconds of the previous. I know this changes things a bit, I'm sorry for being wishywashy on this.
I am trying to fuzzy-match logs from different servers. Server #1 may log an item, "Item #1", and Server #2 will log that same item, "Item #1", within a few seconds of server #1. I need to do some aggregate functions on both log lines. Unfortunately, I only have title to go on, due to the nature of the server software.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我正在使用 Tom H. 的绝妙想法,但这里的做法略有不同:
我们可以找到所有作为链开头的时间,而不是查找作为链开头的所有行,然后返回并找到与时间匹配的行。
这里的查询 #1 应该通过查找哪些时间没有低于它们但在 3 秒内的时间来告诉您哪些时间是链的开始:
然后对于每一行,我们可以找到小于我们的最大链开始时间戳查询 #2 的时间戳:
一旦我们有了它,我们就可以根据您的需要对其进行分组。
我不完全确定这与 Tom H 的答案是否足够不同,需要单独发布,但听起来你在实施方面遇到了麻烦,我正在考虑它,所以我想我会再次发布。祝你好运!
I'm using Tom H.'s excellent idea but doing it a little differently here:
Instead of finding all the rows that are the beginnings of chains, we can find all times that are the beginnings of chains, then go back and ifnd the rows that match the times.
Query #1 here should tell you which times are the beginnings of chains by finding which times do not have any times below them but within 3 seconds:
And then for each row, we can find the largest chain-starting timestamp that is less than our timestamp with Query #2:
Once we have that, we can GROUP BY it as you wanted.
I'm not entirely sure this is distinct enough from Tom H's answer to be posted separately, but it sounded like you were having trouble with implementation, and I was thinking about it, so I thought I'd post again. Good luck!
现在我认为我理解了你的问题,根据你对 OMG Ponies 的评论回复,我认为我有一个基于集合的解决方案。这个想法是首先根据标题找到任何链的开头。链的开始将被定义为该行之前三秒内没有匹配的任何行:
现在我们可以假设任何非链启动器都属于出现在它们之前的链启动器。由于 MySQL 不支持 CTE,您可能希望将上述结果放入临时表中,因为这样可以节省与下面同一子查询的多个联接。
如果您可以使用 CTE 或使用临时表,这看起来会简单得多。使用临时表也可能有助于提高性能。
此外,如果您的时间戳完全匹配,也会出现问题。如果是这种情况,那么您将需要稍微调整查询以使用 id 和时间戳的组合来区分具有匹配时间戳值的行。
编辑:更改了查询以按时间戳处理精确匹配。
Now that I think that I understand your problem, based on your comment response to OMG Ponies, I think that I have a set-based solution. The idea is to first find the start of any chains based on the title. The start of a chain is going to be defined as any row where there is no match within three seconds prior to that row:
Now we can assume that any non-chain starters belong to the chain starter that appeared before them. Since MySQL doesn't support CTEs, you might want to throw the above results into a temporary table, as that would save you the multiple joins to the same subquery below.
This would look much simpler if you could use CTEs or if you used a temporary table. Using the temporary table might also help performance.
Also, there will be issues with this if you can have timestamps that match exactly. If that's the case then you will need to tweak the query slightly to use a combination of the id and the timestamp to distinguish rows with matching timestamp values.
EDIT: Changed the queries to handle exact matches by timestamp.
警告:答案很长。这应该可以工作,并且相当简洁,除了中间的一个步骤,您必须愿意一遍又一遍地运行 INSERT 语句,直到它不执行任何操作,因为我们无法在 MySQL 中执行递归 CTE 操作。
我将使用此数据作为示例,而不是您的:
这是要编写的第一个查询:
它返回:
让我们创建一个漂亮的表来保存那些不允许重复的内容:
现在的挑战是找到类似的内容该关系的传递闭包。
为此,我们需要找到下一级链接。我的意思是,由于邻接表中有
1 2
和2 3
,我们应该添加1 3
:这是非优雅的部分:您需要一遍又一遍地运行上面的 INSERT 语句,直到它不会向表中添加任何行。我不知道是否有一个巧妙的方法可以做到这一点。
一旦这一切结束,你将拥有一个像这样的传递闭合关系:
现在是重点:
返回:
所以
返回
唷!
Warning: Long answer. This should work, and is fairly neat, except for one step in the middle where you have to be willing to run an INSERT statement over and over until it doesn't do anything since we can't do recursive CTE things in MySQL.
I'm going to use this data as the example instead of yours:
Here is the first query to write:
It returns:
Let's create a nice table to hold those things that won't allow duplicates:
Now the challenge is to find something like the transitive closure of that relation.
To do so, let's find the next level of links. by that I mean, since we have
1 2
and2 3
in the Adjacency table, we should add1 3
:This is the non-elegant part: You'll need to run the above INSERT statement over and over until it doesn't add any rows to the table. I don't know if there is a neat way to do that.
Once this is over, you will have a transitively-closed relation like this:
And now for the punchline:
returns:
So
returns
Whew!
我喜欢@Chris Cunningham 的回答,但这里有另一种看法。
首先,我对你的问题陈述的理解(如果我错了,请纠正我):
我主要在 SQL Server 中工作,因此我使用 SQL Server 语法。翻译成MySQL SQL应该不会太困难。
因此,首先是我们的事件日志表:
鉴于上述对问题陈述的理解,以下查询应该为您提供组的上限和下限。这是一个简单的嵌套 select 语句,带有 2 个
group by
来折叠内容:select
定义每个组的上限。该上限定义了一个组。select
定义每个组的下限。表中的每一行都应属于如此定义的组之一,并且任何给定组很可能由单个日期/时间值组成。
[编辑:上限是间隔超过 3 秒的最低日期/时间值]
然后,您可以从事件日志中提取行并用它们所属的组标记它们,因此:
每行都用其所属的组标记通过返回的 dtFrom 和 dtThru 列。如果您愿意,您可以想象并为每个组分配一个完整的行号。
I like @Chris Cunningham's answer, but here's another take on it.
First, my understanding of your problem statement (correct me if I'm wrong):
I work mostly in SQL Server, so I'm using SQL Server syntax. It shouldn't be too difficult to translate into MySQL SQL.
So, first our event log table:
Given the above understanding of the problem statement, the following query should give you the upper and lower bounds your groups. It's a simple, nested select statement with 2
group by
to collapse things:select
defines the upper bound of each group. That upper boundary defines a group.select
defines the lower bound of each group.Every row in the table should fall into one of the groups so defined, and any given group may well consist of a single date/time value.
[edited: the upper bound is the lowest date/time value where the interval is more than 3 seconds]
You could then pull rows from the event log and tag them with the group to which they belong thus:
Each row is tagged with its group via the
dtFrom
anddtThru
columns returned. You could get fancy and assign an integral row number to each group if you want.简单查询:
Simple query: