在 SQL 中执行类似数据透视表的 JOIN

发布于 2024-11-15 23:34:18 字数 2542 浏览 2 评论 0原文

我的雇主有一个批处理计算集群，用于处理用户提交的作业。每个批处理作业由三个步骤组成：

作业开始
作业完成
向用户报告的结果

批处理作业管理软件会记录每个步骤发生的时间，日志文件由一个元组组成，其中包含提交作业的员工的 ID 代码、什么发生的步骤以及发生的时间戳。在 CSV 中，它看起来像：

ID  step    timestamp
--  ------  ---------
A   start   3533
B   start   3538
B   finish  3549
C   start   3551
A   finish  3557
B   report  3559
C   finish  3602
A   report  3603
B   start   3611
C   report  3623
B   finish  3643
B   report  3657

等等。

该数据集的另一个特征是员工之间存在共识，但员工内部没有共识；即，每个员工必须等到当前的工作报告完毕后才能开始下一个工作。因此，当我按日期排序并将结果限制为单个员工时，记录总是按“开始”、“完成”、“报告”的顺序出现。

我想创建一个数据透视表，将每个作业分组为一行。所以上面的数据变成：

employee-ID  started  finished  reported
-----------  -------  --------  --------
A            3533     3557      3603
B            3538     3549      3559
B            3611     3643      3657
C            3551     3602      3623

那么，关于 SQL：

SELECT
    log.ID AS employee-ID,
    start.timestamp AS started,
    finish.timestamp AS finished,
    report.timestamp AS reported
FROM
    log

    LEFT OUTER JOIN log start ON
    log.ID = start.ID
        AND start.step = 'start'

    LEFT OUTER JOIN log finish ON
    log.ID = finish.ID
        AND finish.step = 'finish'
        AND start.timestamp < finish.timestamp

    LEFT OUTER JOIN log report ON
    log.ID = report.ID
        AND report.step = 'report'
        AND finish.timestamp < report.timestamp

ORDER BY employee-ID,started,finished,reported;

我确实需要 LEFT OUTER JOIN，因为我还需要识别已启动但未完成或报告的作业。

这效果非常好。它确实给了我我需要的行。但它给了我很多虚假行，因为除了当前工作之外，JOIN 还匹配同一员工未来工作的 finish 和 report 条目。因此，报告看起来像这样：

employee-ID  started  finished  reported
-----------  -------  --------  --------
A            3533     3557      3603
B            3538     3549      3559
B            3538     3549      3657 <-- spurious
B            3538     3643      3657 <-- spurious
B            3611     3643      3657
C            3551     3602      3623

很容易识别虚假行：每个作业仅启动一次，因此考虑到排序，正确的行是具有唯一“启动”值的第一行。我现在正在应用程序级别通过跳过虚假行来解决虚假行问题，但这似乎不太优雅。而且成本高昂：其中一些员工提交了数十份工作，因此目前我的查询结果大约有 15% 是合法条目，85% 是虚假条目。跳过虚假条目会浪费大量时间。如果每个作业都有一个唯一的 ID，那就太好了，但我只是没有该数据。

我需要以某种方式限制 JOIN，以便它为每个“开始”条目仅挑选一个“完成”和“报告”条目：最小时间戳大于前一步骤时间戳的单个条目。我尝试通过使用子查询作为我要加入的表来执行此操作，但我不知道如何在没有相关子查询的情况下执行此操作。我还尝试使用“GROUP BY employee-ID,started”来执行此操作，但这不一定会选择“正确”的行。 “GROUP BY”报告的大多数行都是错误的。

那么，SQL 专家，是否可以只报告我需要的行？如果是这样，怎么办？我现在正在使用 sqlite3，但如果需要的话可以将数据库传输到 MySQL。

原文

My employer has a batch compute cluster that processes jobs submitted by users. Each batch job consists of three steps:

job started
job finished
results reported to the user

The batch job management software logs when each of these steps happen, and the logfile consists of a tuple with an ID code of the employee who submitted the job, what step occurred, and a timestamp of when it occurred. In CSV, it looks like:

ID  step    timestamp
--  ------  ---------
A   start   3533
B   start   3538
B   finish  3549
C   start   3551
A   finish  3557
B   report  3559
C   finish  3602
A   report  3603
B   start   3611
C   report  3623
B   finish  3643
B   report  3657

And so forth.

One additional characteristic of the dataset is that there is concurrence between employees, but no concurrence within employees; ie, each employee has to wait until their current job has reported before their next job starts. So when I sort by date and limit the results to a single employee, the records always come out in the order "start","finish","report".

I want to create a pivot table that groups each job into a single row. So the above data becomes:

employee-ID  started  finished  reported
-----------  -------  --------  --------
A            3533     3557      3603
B            3538     3549      3559
B            3611     3643      3657
C            3551     3602      3623

So, on to the SQL:

SELECT
    log.ID AS employee-ID,
    start.timestamp AS started,
    finish.timestamp AS finished,
    report.timestamp AS reported
FROM
    log

    LEFT OUTER JOIN log start ON
    log.ID = start.ID
        AND start.step = 'start'

    LEFT OUTER JOIN log finish ON
    log.ID = finish.ID
        AND finish.step = 'finish'
        AND start.timestamp < finish.timestamp

    LEFT OUTER JOIN log report ON
    log.ID = report.ID
        AND report.step = 'report'
        AND finish.timestamp < report.timestamp

ORDER BY employee-ID,started,finished,reported;

I do need LEFT OUTER JOIN, because I also need to identify the jobs that were started but were not finished or reported.

This works pretty well. It does give me the rows that I need. But it gives me a lot of spurious rows, because the JOINs match finish and report entries for future jobs of the same employee in addition to the current job. So the report comes out looking like:

employee-ID  started  finished  reported
-----------  -------  --------  --------
A            3533     3557      3603
B            3538     3549      3559
B            3538     3549      3657 <-- spurious
B            3538     3643      3657 <-- spurious
B            3611     3643      3657
C            3551     3602      3623

It's easy to recognize the spurious rows: each job gets started only once, so given the sorting, the correct row is the first row with a unique "started" value. I'm working around the spurious rows problem right now at the application level by just skipping over the spurious rows, but that just seems, well, inelegant. And it's costly: some of these employees have dozens of jobs submitted, so currently, the results of my queries are about 15% legitimate entries and 85% spurious. That's a lot of wasted time skipping over bogus entries. It would be nice if each job had a unique ID, but I just don't have that data.

I need to somehow limit the JOIN so that it picks off only one "finished" and "reported" entry for each "started" entry: the single entry that has the minimum timestamp greater than the timestamp of the preceding step. I tried doing this by using a subquery as the table to which I was JOINing, but I couldn't figure out how to do it without a correlated subquery. I also tried doing it by using "GROUP BY employee-ID,started", but this didn't necessarily pick the "correct" row. Most of the rows the "GROUP BY" reported were the wrong ones.

So, SQL gurus, is it possible to report just the rows I need to? And if so, how? I am using sqlite3 right now, but could transfer the database to MySQL if needed.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ㄖ落Θ余辉 2024-11-22 23:34:18

问题是您如何加入 finish 和 report

您不希望 start.timestamp < finish.timestamp 你真的想要 start.timestamp < MIN(finish.timestamp)

当然，这不起作用，因此您必须在连接后执行此操作。

例如

SELECT
    log.ID AS employee_ID,
    start.timestamp AS started,
    MIN(finish.timestamp) AS finished,
    MIN(report.timestamp) AS reported
FROM
    log


LEFT OUTER JOIN log start ON
log.ID = start.ID
    AND start.step = 'start'

LEFT OUTER JOIN log finish ON
log.ID = finish.ID
    AND finish.step = 'finish'
    AND start.timestamp < finish.timestamp

LEFT OUTER JOIN log report ON
log.ID = report.ID
    AND report.step = 'report'
    AND finish.timestamp < report.timestamp

GROUP BY log.ID,
    start.timestamp 
ORDER BY 
    employee_ID,started,finished,reported

，您也可以将开始转换为内部连接，因为没有开始的结束没有多大意义

The problem is how you're joining to finish and report

You don't want the start.timestamp < finish.timestamp you really want start.timestamp < MIN(finish.timestamp)

Of course that doesn't work so you'll have to do it after the join.

e.g.

SELECT
    log.ID AS employee_ID,
    start.timestamp AS started,
    MIN(finish.timestamp) AS finished,
    MIN(report.timestamp) AS reported
FROM
    log


LEFT OUTER JOIN log start ON
log.ID = start.ID
    AND start.step = 'start'

LEFT OUTER JOIN log finish ON
log.ID = finish.ID
    AND finish.step = 'finish'
    AND start.timestamp < finish.timestamp

LEFT OUTER JOIN log report ON
log.ID = report.ID
    AND report.step = 'report'
    AND finish.timestamp < report.timestamp

GROUP BY log.ID,
    start.timestamp 
ORDER BY 
    employee_ID,started,finished,reported

Also you could probably convert the start to an inner join as it doesn't make a whole lot of sense to have a finish without a start

回复收藏 0 原文

~没有更多了~