根据行号不包括特定行

发布于 2025-02-11 19:13:14 字数 1830 浏览 2 评论 0原文

我有一个看起来像这样的数据集：

| ReportId  | Method | Status | OrganizationId | StartedAt                     |
|-----------|--------|--------|----------------|-------------------------------|
| 38373bfk8 | Email  | 0      | ABC            | 2022-06-10 00:00:53.794 +0000 |
| 78687fea  | Email  | 0      | XYZ            | 2022-06-10 00:03:51.432 +0000 |
| 48978kd   | Email  | 100    | POD            | 2022-06-10 00:02:45.532 +0000 |
| 38373bfk8 | Email  | 100    | ABC            | 2022-06-10 00:00:22.654 +0000 |
| 86887dhd  | Csv    | 100    | FGH            | 2022-06-10 00:03:12.541 +0000 |
| 78687fea  | Email  | 100    | XYZ            | 2022-06-11 00:04:51.352 +0000 |

在状态列中，0指示失败，100表示为生成组织报告的成功。

我想实现逻辑，如果某个ReportID/Method/andymanID组合失败，但相同的组合在同一天晚些时候具有成功的状态，然后排除该初始失败。我本质上想保留所有成功的行和所有失败行，这些行没有在同一天晚些时候取得成功。

从上面的数据集中，我们将删除第一行，因为当天晚些时候该报告/方法/状态组合取得了成功（第4行）。尽管这也是失败的，我们仍将保持第二行，因为在同一天晚些时候没有成功的状态（第二天第6行）。因此，除第一个外，所有行都将保留。

我已经建立了一个CTE来相应地对行进行排名：

with Ranked as (
    select
        ReportId,
        Method,
        Status,
        OrganizationId,
        StartedAt,
        row_number() over (partition by ReportId, Method, OrganizationId, cast(StartedAt as date) order by StartedAt asc) as rn
    from
        MyTable
)

然后，我根据上述逻辑过滤了我不想要的行：

ExcludeFirstFailures as (
    select
        ReportId,
        Method,
        Status,
        OrganizationId,
        StartedAt,
        rn
    from
        Ranked
    where
        (Status in 0 and rn >  1) --Keep failures that weren't the first of the day
        or Status = 100 --Keep all successful rows
)

这与我需要的内容接近我所需要的，但是问题是它正在滤除 all all /em>失败是第一天的失败，如果同一天晚些时候成功进行了相同的报告，则没有进行比较 - 只有这些失败才能被排除在外。

原文

I have a dataset which looks like this:

| ReportId  | Method | Status | OrganizationId | StartedAt                     |
|-----------|--------|--------|----------------|-------------------------------|
| 38373bfk8 | Email  | 0      | ABC            | 2022-06-10 00:00:53.794 +0000 |
| 78687fea  | Email  | 0      | XYZ            | 2022-06-10 00:03:51.432 +0000 |
| 48978kd   | Email  | 100    | POD            | 2022-06-10 00:02:45.532 +0000 |
| 38373bfk8 | Email  | 100    | ABC            | 2022-06-10 00:00:22.654 +0000 |
| 86887dhd  | Csv    | 100    | FGH            | 2022-06-10 00:03:12.541 +0000 |
| 78687fea  | Email  | 100    | XYZ            | 2022-06-11 00:04:51.352 +0000 |

In the Status column, 0 indicates a failure and 100 indicates a success for the generation of a Report for an Org.

I want to implement logic that if a certain ReportId/Method/OrganizationId combination fails but the same combination has a successful status later in the same day, then exclude that initial failure. I essentially want to keep all successful rows and all failure rows that didn't have a success later in the same day.

From the above dataset, we would remove the first row as there is a success for that ReportId/Method/Status combination later in the day (row 4). We would keep row two despite it also being a failure, because there is no successful status later in the same day (row 6 is the next day). So all rows would be kept except the first.

I have built a CTE to rank the rows based accordingly:

with Ranked as (
    select
        ReportId,
        Method,
        Status,
        OrganizationId,
        StartedAt,
        row_number() over (partition by ReportId, Method, OrganizationId, cast(StartedAt as date) order by StartedAt asc) as rn
    from
        MyTable
)

Then I filter out the rows I don't want based on the above logic:

ExcludeFirstFailures as (
    select
        ReportId,
        Method,
        Status,
        OrganizationId,
        StartedAt,
        rn
    from
        Ranked
    where
        (Status in 0 and rn >  1) --Keep failures that weren't the first of the day
        or Status = 100 --Keep all successful rows
)

This is close to what I need but the problem is that it is filtering out all failures which were the first of the day, without making a comparison if the same report ran ran successfully later on the same day - only those failures should be excluded.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

水波映月 2025-02-18 19:13:14

你可以尝试这个吗？

with successful as (
  select
      ReportId,
      Method,
      Status,
      OrganizationId,
      StartedAt,
      MAX(Status) over (partition by ReportId, Method, OrganizationId, cast(StartedAt as date)) as success
  from
      MyTable
  )
select *
from successful
where
    success = 0  -- all failure rows that didn't have a success later
    or Status = 100 --Keep all successful rows
order by STARTEDAT, REPORTID, status
;

+-----------+--------+--------+----------------+-------------------------------+---------+
| REPORTID  | METHOD | STATUS | ORGANIZATIONID |           STARTEDAT           | SUCCESS |
+-----------+--------+--------+----------------+-------------------------------+---------+
| 38373bfk8 | Email  |    100 | ABC            | 2022-06-10 00:00:22.654 +0000 |     100 |
| 48978kd   | Email  |    100 | POD            | 2022-06-10 00:02:45.532 +0000 |     100 |
| 86887dhd  | Csv    |    100 | FGH            | 2022-06-10 00:03:12.541 +0000 |     100 |
| 78687fea  | Email  |      0 | XYZ            | 2022-06-10 00:03:51.432 +0000 |       0 |
| 78687fea  | Email  |    100 | XYZ            | 2022-06-11 00:04:51.352 +0000 |     100 |
+-----------+--------+--------+----------------+-------------------------------+---------+

78687FEA在2022-06-10取得成功。

额外信息：

我得知状态代码15也被认为是成功的。如果这是成功代码的完整列表，我们仍然可以使用Max方法：而不是使用Max的普通“状态”列，我们可以使用“ Iff（status = 15，'100'，status）”表达式和过滤器对于“（15,100）中的状态”。

为了能够进行测试，我将状态设置为15报告（ID：38373BFK8和48978KD）。

修改后的SQL的输出：

+-----------+--------+--------+----------------+-------------------------------+---------+
| REPORTID  | METHOD | STATUS | ORGANIZATIONID |           STARTEDAT           | SUCCESS |
+-----------+--------+--------+----------------+-------------------------------+---------+
| 38373bfk8 | Email  |     15 | ABC            | 2022-06-10 00:00:22.654 +0000 |     100 |
| 48978kd   | Email  |     15 | POD            | 2022-06-10 00:02:45.532 +0000 |     100 |
| 86887dhd  | Csv    |    100 | FGH            | 2022-06-10 00:03:12.541 +0000 |     100 |
| 78687fea  | Email  |      0 | XYZ            | 2022-06-10 00:03:51.432 +0000 |       0 |
| 78687fea  | Email  |    100 | XYZ            | 2022-06-11 00:04:51.352 +0000 |     100 |
+-----------+--------+--------+----------------+-------------------------------+---------+

Can you try this one?

with successful as (
  select
      ReportId,
      Method,
      Status,
      OrganizationId,
      StartedAt,
      MAX(Status) over (partition by ReportId, Method, OrganizationId, cast(StartedAt as date)) as success
  from
      MyTable
  )
select *
from successful
where
    success = 0  -- all failure rows that didn't have a success later
    or Status = 100 --Keep all successful rows
order by STARTEDAT, REPORTID, status
;

+-----------+--------+--------+----------------+-------------------------------+---------+
| REPORTID  | METHOD | STATUS | ORGANIZATIONID |           STARTEDAT           | SUCCESS |
+-----------+--------+--------+----------------+-------------------------------+---------+
| 38373bfk8 | Email  |    100 | ABC            | 2022-06-10 00:00:22.654 +0000 |     100 |
| 48978kd   | Email  |    100 | POD            | 2022-06-10 00:02:45.532 +0000 |     100 |
| 86887dhd  | Csv    |    100 | FGH            | 2022-06-10 00:03:12.541 +0000 |     100 |
| 78687fea  | Email  |      0 | XYZ            | 2022-06-10 00:03:51.432 +0000 |       0 |
| 78687fea  | Email  |    100 | XYZ            | 2022-06-11 00:04:51.352 +0000 |     100 |
+-----------+--------+--------+----------------+-------------------------------+---------+

78687fea wasn't successful on 2022-06-10.

Extra info:

I'm informed that status code 15 is also counted as successful. If this is the complete list of successful codes, we can still use the MAX approach: Instead of using plain "status" column with MAX, we can use "IFF(Status = 15,'100', Status )" expression, and filter for "Status IN (15,100)".

To be able to test, I set the status to 15 for reports (id: 38373bfk8 and 48978kd).

The output of the modified SQL:

+-----------+--------+--------+----------------+-------------------------------+---------+
| REPORTID  | METHOD | STATUS | ORGANIZATIONID |           STARTEDAT           | SUCCESS |
+-----------+--------+--------+----------------+-------------------------------+---------+
| 38373bfk8 | Email  |     15 | ABC            | 2022-06-10 00:00:22.654 +0000 |     100 |
| 48978kd   | Email  |     15 | POD            | 2022-06-10 00:02:45.532 +0000 |     100 |
| 86887dhd  | Csv    |    100 | FGH            | 2022-06-10 00:03:12.541 +0000 |     100 |
| 78687fea  | Email  |      0 | XYZ            | 2022-06-10 00:03:51.432 +0000 |       0 |
| 78687fea  | Email  |    100 | XYZ            | 2022-06-11 00:04:51.352 +0000 |     100 |
+-----------+--------+--------+----------------+-------------------------------+---------+

回复收藏 0 原文

_失温 2025-02-18 19:13:14

那呢？

with report_rn as (
    select
        ReportId,
        Method,
        Status,
        OrganizationId,
        StartedAt,
        row_number() over (
            partition by ReportId, Method, OrganizationId, cast(StartedAt as date) 
            order by status desc, StartedAt desc
        ) as rn
    from
        report
    qualify rn = 1
)
select
    ReportId,
    Method,
    Status,
    OrganizationId,
    StartedAt
from report_rn 
order by StartedAt;

+-----------+--------+--------+----------------+-------------------------------+
| REPORTID  | METHOD | STATUS | ORGANIZATIONID | STARTEDAT                     |
|-----------+--------+--------+----------------+-------------------------------|
| 38373bfk8 | Email  |    100 | ABC            | 2022-06-10 00:00:22.654000000 |
| 48978kd   | Email  |    100 | POD            | 2022-06-10 00:02:45.532000000 |
| 86887dhd  | Csv    |    100 | FGH            | 2022-06-10 00:03:12.541000000 |
| 78687fea  | Email  |      0 | XYZ            | 2022-06-10 00:03:51.432000000 |
| 78687fea  | Email  |    100 | XYZ            | 2022-06-11 00:04:51.352000000 |
+-----------+--------+--------+----------------+-------------------------------+

如果仅运行一次，则成功或失败的rn = 1。如果在同一天晚些时候成功，因为我们按状态desc订购，状态100将位于顶部，因此将选择它。

如果同一天发生多次失败，那么最新的失败将被返回，因为我们也再次通过startat desc订购。

What about this one?

with report_rn as (
    select
        ReportId,
        Method,
        Status,
        OrganizationId,
        StartedAt,
        row_number() over (
            partition by ReportId, Method, OrganizationId, cast(StartedAt as date) 
            order by status desc, StartedAt desc
        ) as rn
    from
        report
    qualify rn = 1
)
select
    ReportId,
    Method,
    Status,
    OrganizationId,
    StartedAt
from report_rn 
order by StartedAt;

+-----------+--------+--------+----------------+-------------------------------+
| REPORTID  | METHOD | STATUS | ORGANIZATIONID | STARTEDAT                     |
|-----------+--------+--------+----------------+-------------------------------|
| 38373bfk8 | Email  |    100 | ABC            | 2022-06-10 00:00:22.654000000 |
| 48978kd   | Email  |    100 | POD            | 2022-06-10 00:02:45.532000000 |
| 86887dhd  | Csv    |    100 | FGH            | 2022-06-10 00:03:12.541000000 |
| 78687fea  | Email  |      0 | XYZ            | 2022-06-10 00:03:51.432000000 |
| 78687fea  | Email  |    100 | XYZ            | 2022-06-11 00:04:51.352000000 |
+-----------+--------+--------+----------------+-------------------------------+

If it ran only once, either successful or failed will have RN = 1. If there is a successful one later on on the same day, because we order by status desc, status 100 will be on the top, so it will be selected.

If there are multiple failures on the same day, then the latest failed one will be returned because we again order by StartedAt desc as well.

回复收藏 0 原文

~没有更多了~