在 SQL 中执行类似数据透视表的 JOIN
我的雇主有一个批处理计算集群,用于处理用户提交的作业。每个批处理作业由三个步骤组成:
- 作业开始
- 作业完成
- 向用户报告的结果
批处理作业管理软件会记录每个步骤发生的时间,日志文件由一个元组组成,其中包含提交作业的员工的 ID 代码、什么发生的步骤以及发生的时间戳。在 CSV 中,它看起来像:
ID step timestamp
-- ------ ---------
A start 3533
B start 3538
B finish 3549
C start 3551
A finish 3557
B report 3559
C finish 3602
A report 3603
B start 3611
C report 3623
B finish 3643
B report 3657
等等。
该数据集的另一个特征是员工之间存在共识,但员工内部没有共识;即,每个员工必须等到当前的工作报告完毕后才能开始下一个工作。因此,当我按日期排序并将结果限制为单个员工时,记录总是按“开始”、“完成”、“报告”的顺序出现。
我想创建一个数据透视表,将每个作业分组为一行。所以上面的数据变成:
employee-ID started finished reported
----------- ------- -------- --------
A 3533 3557 3603
B 3538 3549 3559
B 3611 3643 3657
C 3551 3602 3623
那么,关于 SQL:
SELECT
log.ID AS employee-ID,
start.timestamp AS started,
finish.timestamp AS finished,
report.timestamp AS reported
FROM
log
LEFT OUTER JOIN log start ON
log.ID = start.ID
AND start.step = 'start'
LEFT OUTER JOIN log finish ON
log.ID = finish.ID
AND finish.step = 'finish'
AND start.timestamp < finish.timestamp
LEFT OUTER JOIN log report ON
log.ID = report.ID
AND report.step = 'report'
AND finish.timestamp < report.timestamp
ORDER BY employee-ID,started,finished,reported;
我确实需要 LEFT OUTER JOIN,因为我还需要识别已启动但未完成或报告的作业。
这效果非常好。它确实给了我我需要的行。但它给了我很多虚假行,因为除了当前工作之外,JOIN 还匹配同一员工未来工作的 finish
和 report
条目。因此,报告看起来像这样:
employee-ID started finished reported
----------- ------- -------- --------
A 3533 3557 3603
B 3538 3549 3559
B 3538 3549 3657 <-- spurious
B 3538 3643 3657 <-- spurious
B 3611 3643 3657
C 3551 3602 3623
很容易识别虚假行:每个作业仅启动一次,因此考虑到排序,正确的行是具有唯一“启动”值的第一行。我现在正在应用程序级别通过跳过虚假行来解决虚假行问题,但这似乎不太优雅。而且成本高昂:其中一些员工提交了数十份工作,因此目前我的查询结果大约有 15% 是合法条目,85% 是虚假条目。跳过虚假条目会浪费大量时间。如果每个作业都有一个唯一的 ID,那就太好了,但我只是没有该数据。
我需要以某种方式限制 JOIN,以便它为每个“开始”条目仅挑选一个“完成”和“报告”条目:最小时间戳大于前一步骤时间戳的单个条目。我尝试通过使用子查询作为我要加入的表来执行此操作,但我不知道如何在没有相关子查询的情况下执行此操作。我还尝试使用“GROUP BY employee-ID,started”来执行此操作,但这不一定会选择“正确”的行。 “GROUP BY”报告的大多数行都是错误的。
那么,SQL 专家,是否可以只报告我需要的行?如果是这样,怎么办?我现在正在使用 sqlite3,但如果需要的话可以将数据库传输到 MySQL。
My employer has a batch compute cluster that processes jobs submitted by users. Each batch job consists of three steps:
- job started
- job finished
- results reported to the user
The batch job management software logs when each of these steps happen, and the logfile consists of a tuple with an ID code of the employee who submitted the job, what step occurred, and a timestamp of when it occurred. In CSV, it looks like:
ID step timestamp
-- ------ ---------
A start 3533
B start 3538
B finish 3549
C start 3551
A finish 3557
B report 3559
C finish 3602
A report 3603
B start 3611
C report 3623
B finish 3643
B report 3657
And so forth.
One additional characteristic of the dataset is that there is concurrence between employees, but no concurrence within employees; ie, each employee has to wait until their current job has reported before their next job starts. So when I sort by date and limit the results to a single employee, the records always come out in the order "start","finish","report".
I want to create a pivot table that groups each job into a single row. So the above data becomes:
employee-ID started finished reported
----------- ------- -------- --------
A 3533 3557 3603
B 3538 3549 3559
B 3611 3643 3657
C 3551 3602 3623
So, on to the SQL:
SELECT
log.ID AS employee-ID,
start.timestamp AS started,
finish.timestamp AS finished,
report.timestamp AS reported
FROM
log
LEFT OUTER JOIN log start ON
log.ID = start.ID
AND start.step = 'start'
LEFT OUTER JOIN log finish ON
log.ID = finish.ID
AND finish.step = 'finish'
AND start.timestamp < finish.timestamp
LEFT OUTER JOIN log report ON
log.ID = report.ID
AND report.step = 'report'
AND finish.timestamp < report.timestamp
ORDER BY employee-ID,started,finished,reported;
I do need LEFT OUTER JOIN, because I also need to identify the jobs that were started but were not finished or reported.
This works pretty well. It does give me the rows that I need. But it gives me a lot of spurious rows, because the JOINs match finish
and report
entries for future jobs of the same employee in addition to the current job. So the report comes out looking like:
employee-ID started finished reported
----------- ------- -------- --------
A 3533 3557 3603
B 3538 3549 3559
B 3538 3549 3657 <-- spurious
B 3538 3643 3657 <-- spurious
B 3611 3643 3657
C 3551 3602 3623
It's easy to recognize the spurious rows: each job gets started only once, so given the sorting, the correct row is the first row with a unique "started" value. I'm working around the spurious rows problem right now at the application level by just skipping over the spurious rows, but that just seems, well, inelegant. And it's costly: some of these employees have dozens of jobs submitted, so currently, the results of my queries are about 15% legitimate entries and 85% spurious. That's a lot of wasted time skipping over bogus entries. It would be nice if each job had a unique ID, but I just don't have that data.
I need to somehow limit the JOIN so that it picks off only one "finished" and "reported" entry for each "started" entry: the single entry that has the minimum timestamp greater than the timestamp of the preceding step. I tried doing this by using a subquery as the table to which I was JOINing, but I couldn't figure out how to do it without a correlated subquery. I also tried doing it by using "GROUP BY employee-ID,started", but this didn't necessarily pick the "correct" row. Most of the rows the "GROUP BY" reported were the wrong ones.
So, SQL gurus, is it possible to report just the rows I need to? And if so, how? I am using sqlite3 right now, but could transfer the database to MySQL if needed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题是您如何加入
finish
和report
您不希望
start.timestamp < finish.timestamp
你真的想要start.timestamp < MIN(finish.timestamp)
当然,这不起作用,因此您必须在连接后执行此操作。
例如
,您也可以将开始转换为内部连接,因为没有开始的结束没有多大意义
The problem is how you're joining to
finish
andreport
You don't want the
start.timestamp < finish.timestamp
you really wantstart.timestamp < MIN(finish.timestamp)
Of course that doesn't work so you'll have to do it after the join.
e.g.
Also you could probably convert the start to an inner join as it doesn't make a whole lot of sense to have a finish without a start