MySQL GROUP BY DateTime +/- 3 秒

发布于 2024-11-18 08:48:30 字数 1114 浏览 6 评论 0原文

假设我有一个包含 3 列的表:

  • id (PK, int)
  • timestamp (datetime)
  • title (text)

我有以下记录:

1, 2010-01-01 15:00:00, Some Title
2, 2010-01-01 15:00:02, Some Title
3, 2010-01-02 15:00:00, Some Title

我需要对彼此相差 3 秒以内的记录进行 GROUP BY。对于此表,第 1 行和第 2 行将分组在一起。

这里有一个类似的问题: Mysql DateTime group by 15 mins

我也发现这个: http://www.artfulsoftware.com/infotree/queries.php#106

我不知道如何将这些方法转换为可以工作几秒钟的方法。 SO问题上的方法的问题在于,在我看来,它只适用于落在从已知点开始的时间段内的记录。例如,如果我要让 FLOOR() 以秒为单位,间隔 5 秒,时间 15:00:04 将与 15:00:01 分组,但不会以 15:00:06 分组。

这有道理吗?如果需要进一步说明,请告诉我。

编辑:对于数字集 {1, 2, 3, 4, 5, 6, 7, 50, 51, 60},似乎最好将它们分组为 {1, 2 , 3, 4, 5, 6, 7}, {50, 51}, {60},以便每个分组行取决于该行是否在前一行的 3 秒内。我知道这会改变一些事情,我很抱歉对此犹豫不决。

我正在尝试模糊匹配来自不同服务器的日志。服务器 #1 可能会记录一个项目“项目 #1”,而服务器 #2 将在服务器 #1 的几秒钟内记录同一项目“项目 #1”。我需要在两个日志行上执行一些聚合函数。不幸的是,由于服务器软件的性质,我只有继续下去的权利。

Suppose I have a table with 3 columns:

  • id (PK, int)
  • timestamp (datetime)
  • title (text)

I have the following records:

1, 2010-01-01 15:00:00, Some Title
2, 2010-01-01 15:00:02, Some Title
3, 2010-01-02 15:00:00, Some Title

I need to do a GROUP BY records that are within 3 seconds of each other. For this table, rows 1 and 2 would be grouped together.

There is a similar question here: Mysql DateTime group by 15 mins

I also found this: http://www.artfulsoftware.com/infotree/queries.php#106

I don't know how to convert these methods into something that will work for seconds. The trouble with the method on the SO question is that it seems to me that it would only work for records falling within a bin of time that starts at a known point. For instance, if I were to get FLOOR() to work with seconds, at an interval of 5 seconds, a time of 15:00:04 would be grouped with 15:00:01, but not grouped with 15:00:06.

Does this make sense? Please let me know if further clarification is needed.

EDIT: For the set of numbers, {1, 2, 3, 4, 5, 6, 7, 50, 51, 60}, it seems it might be best to group them {1, 2, 3, 4, 5, 6, 7}, {50, 51}, {60}, so that each grouping row depends on if the row is within 3 seconds of the previous. I know this changes things a bit, I'm sorry for being wishywashy on this.

I am trying to fuzzy-match logs from different servers. Server #1 may log an item, "Item #1", and Server #2 will log that same item, "Item #1", within a few seconds of server #1. I need to do some aggregate functions on both log lines. Unfortunately, I only have title to go on, due to the nature of the server software.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

尬尬 2024-11-25 08:48:30

我正在使用 Tom H. 的绝妙想法,但这里的做法略有不同:

我们可以找到所有作为链开头的时间,而不是查找作为链开头的所有行,然后返回并找到与时间匹配的行。

这里的查询 #1 应该通过查找哪些时间没有低于它们但在 3 秒内的时间来告诉您哪些时间是链的开始:

SELECT DISTINCT Timestamp
FROM Table a
LEFT JOIN Table b
ON (b.Timestamp >= a.TimeStamp - INTERVAL 3 SECONDS
    AND b.Timestamp < a.Timestamp)
WHERE b.Timestamp IS NULL

然后对于每一行,我们可以找到小于我们的最大链开始时间戳查询 #2 的时间戳:

SELECT Table.id, MAX(StartOfChains.TimeStamp) AS ChainStartTime
FROM Table
JOIN ([query #1]) StartofChains
ON Table.Timestamp >= StartOfChains.TimeStamp
GROUP BY Table.id

一旦我们有了它,我们就可以根据您的需要对其进行分组。

SELECT COUNT(*) --or whatever
FROM Table
JOIN ([query #2]) GroupingQuery
ON Table.id = GroupingQuery.id
GROUP BY GroupingQuery.ChainStartTime

我不完全确定这与 Tom H 的答案是否足够不同,需要单独发布,但听起来你在实施方面遇到了麻烦,我正在考虑它,所以我想我会再次发布。祝你好运!

I'm using Tom H.'s excellent idea but doing it a little differently here:

Instead of finding all the rows that are the beginnings of chains, we can find all times that are the beginnings of chains, then go back and ifnd the rows that match the times.

Query #1 here should tell you which times are the beginnings of chains by finding which times do not have any times below them but within 3 seconds:

SELECT DISTINCT Timestamp
FROM Table a
LEFT JOIN Table b
ON (b.Timestamp >= a.TimeStamp - INTERVAL 3 SECONDS
    AND b.Timestamp < a.Timestamp)
WHERE b.Timestamp IS NULL

And then for each row, we can find the largest chain-starting timestamp that is less than our timestamp with Query #2:

SELECT Table.id, MAX(StartOfChains.TimeStamp) AS ChainStartTime
FROM Table
JOIN ([query #1]) StartofChains
ON Table.Timestamp >= StartOfChains.TimeStamp
GROUP BY Table.id

Once we have that, we can GROUP BY it as you wanted.

SELECT COUNT(*) --or whatever
FROM Table
JOIN ([query #2]) GroupingQuery
ON Table.id = GroupingQuery.id
GROUP BY GroupingQuery.ChainStartTime

I'm not entirely sure this is distinct enough from Tom H's answer to be posted separately, but it sounded like you were having trouble with implementation, and I was thinking about it, so I thought I'd post again. Good luck!

姜生凉生 2024-11-25 08:48:30

现在我认为我理解了你的问题,根据你对 OMG Ponies 的评论回复,我认为我有一个基于集合的解决方案。这个想法是首先根据标题找到任何链的开头。链的开始将被定义为该行之前三秒内没有匹配的任何行:

SELECT
    MT1.my_id,
    MT1.title,
    MT1.my_time
FROM
    My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
    MT2.title = MT1.title AND
    (
        MT2.my_time < MT1.my_time OR
        (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
    ) AND
    MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
    MT2.my_id IS NULL

现在我们可以假设任何非链启动器都属于出现在它们之前的链启动器。由于 MySQL 不支持 CTE,您可能希望将上述结果放入临时表中,因为这样可以节省与下面同一子查询的多个联接。

SELECT
    SQ1.my_id,
    COUNT(*)  -- You didn't say what you were trying to calculate, just that you needed to group them
FROM
(
    SELECT
        MT1.my_id,
        MT1.title,
        MT1.my_time
    FROM
        My_Table MT1
    LEFT OUTER JOIN My_Table MT2 ON
        MT2.title = MT1.title AND
        (
            MT2.my_time < MT1.my_time OR
            (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
        ) AND
        MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
    WHERE
        MT2.my_id IS NULL
) SQ1
INNER JOIN My_Table MT3 ON
    MT3.title = SQ1.title AND
    MT3.my_time >= SQ1.my_time
LEFT OUTER JOIN
(
    SELECT
        MT1.my_id,
        MT1.title,
        MT1.my_time
    FROM
        My_Table MT1
    LEFT OUTER JOIN My_Table MT2 ON
        MT2.title = MT1.title AND
        (
            MT2.my_time < MT1.my_time OR
            (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
        ) AND
        MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
    WHERE
        MT2.my_id IS NULL
) SQ2 ON
    SQ2.title = SQ1.title AND
    SQ2.my_time > SQ1.my_time AND
    SQ2.my_time <= MT3.my_time
WHERE
    SQ2.my_id IS NULL

如果您可以使用 CTE 或使用临时表,这看起来会简单得多。使用临时表也可能有助于提高性能。

此外,如果您的时间戳完全匹配,也会出现问题。如果是这种情况,那么您将需要稍微调整查询以使用 id 和时间戳的组合来区分具有匹配时间戳值的行。

编辑:更改了查询以按时间戳处理精确匹配。

Now that I think that I understand your problem, based on your comment response to OMG Ponies, I think that I have a set-based solution. The idea is to first find the start of any chains based on the title. The start of a chain is going to be defined as any row where there is no match within three seconds prior to that row:

SELECT
    MT1.my_id,
    MT1.title,
    MT1.my_time
FROM
    My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
    MT2.title = MT1.title AND
    (
        MT2.my_time < MT1.my_time OR
        (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
    ) AND
    MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
    MT2.my_id IS NULL

Now we can assume that any non-chain starters belong to the chain starter that appeared before them. Since MySQL doesn't support CTEs, you might want to throw the above results into a temporary table, as that would save you the multiple joins to the same subquery below.

SELECT
    SQ1.my_id,
    COUNT(*)  -- You didn't say what you were trying to calculate, just that you needed to group them
FROM
(
    SELECT
        MT1.my_id,
        MT1.title,
        MT1.my_time
    FROM
        My_Table MT1
    LEFT OUTER JOIN My_Table MT2 ON
        MT2.title = MT1.title AND
        (
            MT2.my_time < MT1.my_time OR
            (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
        ) AND
        MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
    WHERE
        MT2.my_id IS NULL
) SQ1
INNER JOIN My_Table MT3 ON
    MT3.title = SQ1.title AND
    MT3.my_time >= SQ1.my_time
LEFT OUTER JOIN
(
    SELECT
        MT1.my_id,
        MT1.title,
        MT1.my_time
    FROM
        My_Table MT1
    LEFT OUTER JOIN My_Table MT2 ON
        MT2.title = MT1.title AND
        (
            MT2.my_time < MT1.my_time OR
            (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
        ) AND
        MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
    WHERE
        MT2.my_id IS NULL
) SQ2 ON
    SQ2.title = SQ1.title AND
    SQ2.my_time > SQ1.my_time AND
    SQ2.my_time <= MT3.my_time
WHERE
    SQ2.my_id IS NULL

This would look much simpler if you could use CTEs or if you used a temporary table. Using the temporary table might also help performance.

Also, there will be issues with this if you can have timestamps that match exactly. If that's the case then you will need to tweak the query slightly to use a combination of the id and the timestamp to distinguish rows with matching timestamp values.

EDIT: Changed the queries to handle exact matches by timestamp.

唐婉 2024-11-25 08:48:30

警告:答案很长。这应该可以工作,并且相当简洁,除了中间的一个步骤,您必须愿意一遍又一遍地运行 INSERT 语句,直到它不执行任何操作,因为我们无法在 MySQL 中执行递归 CTE 操作。

我将使用此数据作为示例,而不是您的:

id    Timestamp
1     1:00:00
2     1:00:03
3     1:00:06
4     1:00:10

这是要编写的第一个查询:

SELECT a.id as aid, b.id as bid
FROM Table a
JOIN Table b 
ON (a.Timestamp is within 3 seconds of b.Timestamp)

它返回:

aid     bid
1       1
1       2
2       1
2       2
2       3
3       2
3       3
4       4

让我们创建一个漂亮的表来保存那些不允许重复的内容:

CREATE TABLE
Adjacency
( aid INT(11)
, bid INT(11)
, PRIMARY KEY (aid, bid) --important for later
)

现在的挑战是找到类似的内容该关系的传递闭包

为此,我们需要找到下一级链接。我的意思是,由于邻接表中有 1 22 3,我们应该添加 1 3

INSERT IGNORE INTO Adjacency(aid,bid)
SELECT adj1.aid, adj2.bid
FROM Adjacency adj1
JOIN Adjacency adj2
ON (adj1.bid = adj2.aid)

这是非优雅的部分:您需要一遍又一遍地运行上面的 INSERT 语句,直到它不会向表中添加任何行。我不知道是否有一个巧妙的方法可以做到这一点。

一旦这一切结束,你将拥有一个像这样的传递闭合关系:

aid     bid
1       1
1       2
1       3     --added
2       1
2       2
2       3
3       1     --added
3       2
3       3
4       4

现在是重点:

SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
FROM Adjacency
GROUP BY aid

返回:

aid     Neighbors
1       1,2,3
2       1,2,3
3       1,2,3
4       4

所以

SELECT DISTINCT Neighbors
FROM (
     SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
     FROM Adjacency
     GROUP BY aid
     ) Groupings

返回

Neighbors
1,2,3
4

唷!

Warning: Long answer. This should work, and is fairly neat, except for one step in the middle where you have to be willing to run an INSERT statement over and over until it doesn't do anything since we can't do recursive CTE things in MySQL.

I'm going to use this data as the example instead of yours:

id    Timestamp
1     1:00:00
2     1:00:03
3     1:00:06
4     1:00:10

Here is the first query to write:

SELECT a.id as aid, b.id as bid
FROM Table a
JOIN Table b 
ON (a.Timestamp is within 3 seconds of b.Timestamp)

It returns:

aid     bid
1       1
1       2
2       1
2       2
2       3
3       2
3       3
4       4

Let's create a nice table to hold those things that won't allow duplicates:

CREATE TABLE
Adjacency
( aid INT(11)
, bid INT(11)
, PRIMARY KEY (aid, bid) --important for later
)

Now the challenge is to find something like the transitive closure of that relation.

To do so, let's find the next level of links. by that I mean, since we have 1 2 and 2 3 in the Adjacency table, we should add 1 3:

INSERT IGNORE INTO Adjacency(aid,bid)
SELECT adj1.aid, adj2.bid
FROM Adjacency adj1
JOIN Adjacency adj2
ON (adj1.bid = adj2.aid)

This is the non-elegant part: You'll need to run the above INSERT statement over and over until it doesn't add any rows to the table. I don't know if there is a neat way to do that.

Once this is over, you will have a transitively-closed relation like this:

aid     bid
1       1
1       2
1       3     --added
2       1
2       2
2       3
3       1     --added
3       2
3       3
4       4

And now for the punchline:

SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
FROM Adjacency
GROUP BY aid

returns:

aid     Neighbors
1       1,2,3
2       1,2,3
3       1,2,3
4       4

So

SELECT DISTINCT Neighbors
FROM (
     SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
     FROM Adjacency
     GROUP BY aid
     ) Groupings

returns

Neighbors
1,2,3
4

Whew!

爱人如己 2024-11-25 08:48:30

我喜欢@Chris Cunningham 的回答,但这里有另一种看法。

首先,我对你的问题陈述的理解(如果我错了,请纠正我):

您想要将事件日志视为一个序列,按事件发生的时间排序,
并将其划分为组,将边界定义为一个区间
序列中相邻两行之间的间隔超过 3 秒。

我主要在 SQL Server 中工作,因此我使用 SQL Server 语法。翻译成MySQL SQL应该不会太困难。

因此,首先是我们的事件日志表:

--
-- our event log table
--
create table dbo.eventLog
(
  id       int          not null ,
  dtLogged datetime     not null ,
  title    varchar(200) not null ,

  primary key nonclustered ( id ) ,
  unique clustered ( dtLogged , id ) ,

)

鉴于上述对问题陈述的理解,以下查询应该为您提供组的上限和下限。这是一个简单的嵌套 select 语句,带有 2 个 group by 来折叠内容:

  • 最里面的 select 定义每个组的上限。该上限定义了一个组。
  • 外部 select 定义每个组的下限。

表中的每一行都应属于如此定义的组之一,并且任何给定组很可能由单个日期/时间值组成。

[编辑:上限是间隔超过 3 秒的最低日期/时间值]

select dtFrom = min( t.dtFrom ) ,
       dtThru =      t.dtThru
from ( select dtFrom = t1.dtLogged ,
              dtThru = min( t2.dtLogged )
       from      dbo.EventLog t1
       left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
                                and datediff(second,t1.dtLogged,t2.dtLogged) > 3
       group by t1.dtLogged
     ) t
group by t.dtThru

然后,您可以从事件日志中提取行并用它们所属的组标记它们,因此:

select *
from ( select dtFrom = min( t.dtFrom ) ,
              dtThru =      t.dtThru
       from ( select dtFrom = t1.dtLogged ,
                     dtThru = min( t2.dtLogged )
              from      dbo.EventLog t1
              left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
                                       and datediff(second,t1.dtLogged,t2.dtLogged) > 3
              group by t1.dtLogged
            ) t
       group by t.dtThru
     ) period
join dbo.EventLog t on t.dtLogged >=           period.dtFrom
                   and t.dtLogged <= coalesce( period.dtThru , t.dtLogged )
order by period.dtFrom , period.dtThru , t.dtLogged

每行都用其所属的组标记通过返回的 dtFrom 和 dtThru 列。如果您愿意,您可以想象并为每个组分配一个完整的行号。

I like @Chris Cunningham's answer, but here's another take on it.

First, my understanding of your problem statement (correct me if I'm wrong):

You want to look at your event log as a sequence, ordered by the time of the event,
and partitition it into groups, defining the boundary as being an interval of
more than 3 seconds between two adjacent rows in the sequence.

I work mostly in SQL Server, so I'm using SQL Server syntax. It shouldn't be too difficult to translate into MySQL SQL.

So, first our event log table:

--
-- our event log table
--
create table dbo.eventLog
(
  id       int          not null ,
  dtLogged datetime     not null ,
  title    varchar(200) not null ,

  primary key nonclustered ( id ) ,
  unique clustered ( dtLogged , id ) ,

)

Given the above understanding of the problem statement, the following query should give you the upper and lower bounds your groups. It's a simple, nested select statement with 2 group by to collapse things:

  • The innermost select defines the upper bound of each group. That upper boundary defines a group.
  • The outer select defines the lower bound of each group.

Every row in the table should fall into one of the groups so defined, and any given group may well consist of a single date/time value.

[edited: the upper bound is the lowest date/time value where the interval is more than 3 seconds]

select dtFrom = min( t.dtFrom ) ,
       dtThru =      t.dtThru
from ( select dtFrom = t1.dtLogged ,
              dtThru = min( t2.dtLogged )
       from      dbo.EventLog t1
       left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
                                and datediff(second,t1.dtLogged,t2.dtLogged) > 3
       group by t1.dtLogged
     ) t
group by t.dtThru

You could then pull rows from the event log and tag them with the group to which they belong thus:

select *
from ( select dtFrom = min( t.dtFrom ) ,
              dtThru =      t.dtThru
       from ( select dtFrom = t1.dtLogged ,
                     dtThru = min( t2.dtLogged )
              from      dbo.EventLog t1
              left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
                                       and datediff(second,t1.dtLogged,t2.dtLogged) > 3
              group by t1.dtLogged
            ) t
       group by t.dtThru
     ) period
join dbo.EventLog t on t.dtLogged >=           period.dtFrom
                   and t.dtLogged <= coalesce( period.dtThru , t.dtLogged )
order by period.dtFrom , period.dtThru , t.dtLogged

Each row is tagged with its group via the dtFrom and dtThru columns returned. You could get fancy and assign an integral row number to each group if you want.

兮颜 2024-11-25 08:48:30

简单查询:

SELECT * FROM time_history GROUP BY ROUND(UNIX_TIMESTAMP(time_stamp)/3);

Simple query:

SELECT * FROM time_history GROUP BY ROUND(UNIX_TIMESTAMP(time_stamp)/3);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文