在 SQL 中查找连续的日期对

发布于 2024-11-30 22:36:41 字数 1858 浏览 0 评论 0原文

我在这里有一个问题，看起来有点像我在搜索中找到的一些问题，但是针对稍微不同的问题的解决方案，重要的是，那些在 SQL 2000 中不起作用的问题。

我有一个非常大的表，其中有很多我试图将冗余数据减少到仅有用的条目。它是一个历史表，按照它的工作方式，如果两个条目在按日期排序时基本上是重复且连续的，则可以删除后者。当请求该条目的生效日期与下一个非重复条目之间的历史数据时，将使用先前条目的数据。

数据如下所示：

id     user_id effective_date important_value useless_value
1      1       1/3/2007       3               0
2      1       1/4/2007       3               1
3      1       1/6/2007       NULL            1
4      1       2/1/2007       3               0
5      2       1/5/2007       12              1
6      3       1/1/1899       7               0

对于此示例集，如果 user_id 和 important_value 相同，我们将认为两个连续行重复。在此示例集中，我们仅删除 id=2 的行，保留 2007 年 1 月 3 日的信息，显示 important_value 在 2007 年 1 月 6 日发生了更改，然后在 2-1-2007 再次显示相关更改。

我目前的方法既笨拙又耗时，我知道一定有更好的方法。我编写了一个脚本，它使用游标来迭代 user_id 值（因为这会将巨大的表分解为可管理的部分），并创建一个仅包含该用户的行的临时表。然后，为了获取连续的条目，它获取临时表，将其连接到自身，条件是临时表中没有其他条目的日期介于两个日期之间。在下面的伪代码中，UDF_SameOrNull 是一个函数，如果传入的两个值相同或均为 NULL，则返回 1。

WHILE (@@fetch_status <> -1)
BEGIN
  SELECT * FROM History INTO #history WHERE user_id = @UserId

  --return entries to delete
  SELECT h2.id
  INTO #delete_history_ids
  FROM #history h1
  JOIN #history h2 ON
    h1.effective_date < h2.effective_date
    AND dbo.UDF_SameOrNull(h1.important_value, h2.important_value)=1
  WHERE NOT EXISTS (SELECT 1 FROM #history hx WHERE hx.effective_date > h1.effective_date and hx.effective_date < h2.effective_date)

  DELETE h1
  FROM History h1
  JOIN #delete_history_ids dh ON
    h1.id = dh.id 

  FETCH NEXT FROM UserCursor INTO @UserId
END

它还会循环遍历同一组重复项，直到没有重复项为止，因为取出行会创建可能是重复项的新的连续对。为了简单起见，我省略了它。

不幸的是，我必须使用 SQL Server 2000 来完成此任务，并且我非常确定它不支持 ROW_NUMBER() 来以更优雅的方式查找连续条目。

感谢您的阅读。对于伪代码中任何不必要的背景故事或错误，我深表歉意。

原文

I have a question here that looks a little like some of the ones that I found in search, but with solutions for slightly different problems and, importantly, ones that don't work in SQL 2000.

I have a very large table with a lot of redundant data that I am trying to reduce down to just the useful entries. It's a history table, and the way it works, if two entries are essentially duplicates and consecutive when sorted by date, the latter can be deleted. The data from the earlier entry will be used when historical data is requested from a date between the effective date of that entry and the next non-duplicate entry.

The data looks something like this:

id     user_id effective_date important_value useless_value
1      1       1/3/2007       3               0
2      1       1/4/2007       3               1
3      1       1/6/2007       NULL            1
4      1       2/1/2007       3               0
5      2       1/5/2007       12              1
6      3       1/1/1899       7               0

With this sample set, we would consider two consecutive rows duplicates if the user_id and the important_value are the same. From this sample set, we would only delete row with id=2, preserving the information from 1-3-2007, showing that the important_value changed on 1-6-2007, and then showing the relevant change again on 2-1-2007.

My current approach is awkward and time-consuming, and I know there must be a better way. I wrote a script that uses a cursor to iterate through the user_id values (since that breaks the huge table up into manageable pieces), and creates a temp table of just the rows for that user. Then to get consecutive entries, it takes the temp table, joins it to itself on the condition that there are no other entries in the temp table with a date between the two dates. In the pseudocode below, UDF_SameOrNull is a function that returns 1 if the two values passed in are the same or if they are both NULL.

WHILE (@@fetch_status <> -1)
BEGIN
  SELECT * FROM History INTO #history WHERE user_id = @UserId

  --return entries to delete
  SELECT h2.id
  INTO #delete_history_ids
  FROM #history h1
  JOIN #history h2 ON
    h1.effective_date < h2.effective_date
    AND dbo.UDF_SameOrNull(h1.important_value, h2.important_value)=1
  WHERE NOT EXISTS (SELECT 1 FROM #history hx WHERE hx.effective_date > h1.effective_date and hx.effective_date < h2.effective_date)

  DELETE h1
  FROM History h1
  JOIN #delete_history_ids dh ON
    h1.id = dh.id 

  FETCH NEXT FROM UserCursor INTO @UserId
END

It also loops over the same set of duplicates until there are none, since taking out rows creates new consecutive pairs that are potentially dupes. I left that out for simplicity.

Unfortunately, I must use SQL Server 2000 for this task and I am pretty sure that it does not support ROW_NUMBER() for a more elegant way to find consecutive entries.

Thanks for reading. I apologize for any unnecessary backstory or errors in the pseudocode.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏至、离别 2024-12-07 22:36:41

好吧，我想我已经弄清楚了，这是一个很好的问题！

首先，我假设 effect_date 列对于 user_id 不会重复。我认为如果情况并非如此，可以对其进行修改以使其工作 - 所以请告诉我我们是否需要考虑这一点。

该过程基本上采用相等的user_id和important_value以及之前的effective_date的值表和自连接。然后，我们在 user_id 上再执行 1 个自联接，通过验证这些记录之间不存在 effective_date 记录，有效地检查上面的 2 个联接记录是否是连续的2 条记录。

现在它只是一个 select 语句 - 它应该选择所有要删除的记录。因此，如果您验证它返回了正确的数据，只需将 select * 更改为 delete tcheck 即可。

如果您有疑问，请告诉我。

select 
    * 
from 
    History tcheck
    inner join History tprev
        on  tprev.[user_id] = tcheck.[user_id]
            and tprev.important_value = tcheck.important_value
            and tprev.effective_date < tcheck.effective_date
    left join History checkbtwn
        on  tcheck.[user_id] = checkbtwn.[user_id]
            and checkbtwn.effective_date < tcheck.effective_date
            and checkbtwn.effective_date > tprev.effective_date
where
    checkbtwn.[user_id] is null

OK, I think I figured this one out, excellent question!

First, I made the assumption that the effective_date column will not be duplicated for a user_id. I think it can be modified to work if that is not the case - so let me know if we need to account for that.

The process basically takes the table of values and self-joins on equal user_id and important_value and prior effective_date. Then, we do 1 more self-join on user_id that effectively checks to see if the 2 joined records above are sequential by verifying that there is no effective_date record that occurs between those 2 records.

It's just a select statement for now - it should select all records that are to be deleted. So if you verify that it is returning the correct data, simply change the select * to delete tcheck.

Let me know if you have questions.

select 
    * 
from 
    History tcheck
    inner join History tprev
        on  tprev.[user_id] = tcheck.[user_id]
            and tprev.important_value = tcheck.important_value
            and tprev.effective_date < tcheck.effective_date
    left join History checkbtwn
        on  tcheck.[user_id] = checkbtwn.[user_id]
            and checkbtwn.effective_date < tcheck.effective_date
            and checkbtwn.effective_date > tprev.effective_date
where
    checkbtwn.[user_id] is null

回复收藏 0 原文

遗弃Ｍ 2024-12-07 22:36:41

好吧，伙计们，我昨晚思考了一些，我想我找到了答案。我希望这对那些必须匹配数据中的连续对并且由于某种原因也陷入 SQL Server 2000 中的人有所帮助。

我受到其他结果的启发，这些结果说使用 ROW_NUMBER()，我使用了非常相似的方法，但有一个身份列。

--create table with identity column
CREATE TABLE #history (
  id int, 
  user_id int, 
  effective_date datetime, 
  important_value int, 
  useless_value int,
  idx int IDENTITY(1,1)
)

--insert rows ordered by effective_date and now indexed in order
INSERT INTO #history
SELECT * FROM History 
WHERE user_id = @user_id
ORDER BY effective_date

--get pairs where consecutive values match
SELECT * 
FROM #history h1
JOIN #history h2 ON
  h1.idx+1 = h2.idx
WHERE h1.important_value = h2.important_value

使用这种方法，我仍然必须迭代结果，直到它不返回任何内容，但我想不出任何解决方法，而且这种方法比我的上一种方法领先了数英里。

OK guys, I did some thinking last night and I think I found the answer. I hope this helps someone else who has to match consecutive pairs in data and for some reason is also stuck in SQL Server 2000.

I was inspired by the other results that say to use ROW_NUMBER(), and I used a very similar approach, but with an identity column.

--create table with identity column
CREATE TABLE #history (
  id int, 
  user_id int, 
  effective_date datetime, 
  important_value int, 
  useless_value int,
  idx int IDENTITY(1,1)
)

--insert rows ordered by effective_date and now indexed in order
INSERT INTO #history
SELECT * FROM History 
WHERE user_id = @user_id
ORDER BY effective_date

--get pairs where consecutive values match
SELECT * 
FROM #history h1
JOIN #history h2 ON
  h1.idx+1 = h2.idx
WHERE h1.important_value = h2.important_value

With this approach, I still have to iterate over the results until it returns nothing, but I can't think of any way around that and this approach is miles ahead of my last one.

回复收藏 0 原文

~没有更多了~