在 SQL 中查找连续的日期对
我在这里有一个问题,看起来有点像我在搜索中找到的一些问题,但是针对稍微不同的问题的解决方案,重要的是,那些在 SQL 2000 中不起作用的问题。
我有一个非常大的表,其中有很多我试图将冗余数据减少到仅有用的条目。它是一个历史表,按照它的工作方式,如果两个条目在按日期排序时基本上是重复且连续的,则可以删除后者。当请求该条目的生效日期与下一个非重复条目之间的历史数据时,将使用先前条目的数据。
数据如下所示:
id user_id effective_date important_value useless_value
1 1 1/3/2007 3 0
2 1 1/4/2007 3 1
3 1 1/6/2007 NULL 1
4 1 2/1/2007 3 0
5 2 1/5/2007 12 1
6 3 1/1/1899 7 0
对于此示例集,如果 user_id
和 important_value
相同,我们将认为两个连续行重复。在此示例集中,我们仅删除 id
=2 的行,保留 2007 年 1 月 3 日的信息,显示 important_value
在 2007 年 1 月 6 日发生了更改,然后在 2-1-2007 再次显示相关更改。
我目前的方法既笨拙又耗时,我知道一定有更好的方法。我编写了一个脚本,它使用游标来迭代 user_id 值(因为这会将巨大的表分解为可管理的部分),并创建一个仅包含该用户的行的临时表。然后,为了获取连续的条目,它获取临时表,将其连接到自身,条件是临时表中没有其他条目的日期介于两个日期之间。在下面的伪代码中,UDF_SameOrNull
是一个函数,如果传入的两个值相同或均为 NULL,则返回 1。
WHILE (@@fetch_status <> -1)
BEGIN
SELECT * FROM History INTO #history WHERE user_id = @UserId
--return entries to delete
SELECT h2.id
INTO #delete_history_ids
FROM #history h1
JOIN #history h2 ON
h1.effective_date < h2.effective_date
AND dbo.UDF_SameOrNull(h1.important_value, h2.important_value)=1
WHERE NOT EXISTS (SELECT 1 FROM #history hx WHERE hx.effective_date > h1.effective_date and hx.effective_date < h2.effective_date)
DELETE h1
FROM History h1
JOIN #delete_history_ids dh ON
h1.id = dh.id
FETCH NEXT FROM UserCursor INTO @UserId
END
它还会循环遍历同一组重复项,直到没有重复项为止,因为取出行会创建可能是重复项的新的连续对。为了简单起见,我省略了它。
不幸的是,我必须使用 SQL Server 2000 来完成此任务,并且我非常确定它不支持 ROW_NUMBER() 来以更优雅的方式查找连续条目。
感谢您的阅读。对于伪代码中任何不必要的背景故事或错误,我深表歉意。
I have a question here that looks a little like some of the ones that I found in search, but with solutions for slightly different problems and, importantly, ones that don't work in SQL 2000.
I have a very large table with a lot of redundant data that I am trying to reduce down to just the useful entries. It's a history table, and the way it works, if two entries are essentially duplicates and consecutive when sorted by date, the latter can be deleted. The data from the earlier entry will be used when historical data is requested from a date between the effective date of that entry and the next non-duplicate entry.
The data looks something like this:
id user_id effective_date important_value useless_value
1 1 1/3/2007 3 0
2 1 1/4/2007 3 1
3 1 1/6/2007 NULL 1
4 1 2/1/2007 3 0
5 2 1/5/2007 12 1
6 3 1/1/1899 7 0
With this sample set, we would consider two consecutive rows duplicates if the user_id
and the important_value
are the same. From this sample set, we would only delete row with id
=2, preserving the information from 1-3-2007, showing that the important_value
changed on 1-6-2007, and then showing the relevant change again on 2-1-2007.
My current approach is awkward and time-consuming, and I know there must be a better way. I wrote a script that uses a cursor to iterate through the user_id
values (since that breaks the huge table up into manageable pieces), and creates a temp table of just the rows for that user. Then to get consecutive entries, it takes the temp table, joins it to itself on the condition that there are no other entries in the temp table with a date between the two dates. In the pseudocode below, UDF_SameOrNull
is a function that returns 1 if the two values passed in are the same or if they are both NULL.
WHILE (@@fetch_status <> -1)
BEGIN
SELECT * FROM History INTO #history WHERE user_id = @UserId
--return entries to delete
SELECT h2.id
INTO #delete_history_ids
FROM #history h1
JOIN #history h2 ON
h1.effective_date < h2.effective_date
AND dbo.UDF_SameOrNull(h1.important_value, h2.important_value)=1
WHERE NOT EXISTS (SELECT 1 FROM #history hx WHERE hx.effective_date > h1.effective_date and hx.effective_date < h2.effective_date)
DELETE h1
FROM History h1
JOIN #delete_history_ids dh ON
h1.id = dh.id
FETCH NEXT FROM UserCursor INTO @UserId
END
It also loops over the same set of duplicates until there are none, since taking out rows creates new consecutive pairs that are potentially dupes. I left that out for simplicity.
Unfortunately, I must use SQL Server 2000 for this task and I am pretty sure that it does not support ROW_NUMBER() for a more elegant way to find consecutive entries.
Thanks for reading. I apologize for any unnecessary backstory or errors in the pseudocode.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,我想我已经弄清楚了,这是一个很好的问题!
首先,我假设
effect_date
列对于user_id
不会重复。我认为如果情况并非如此,可以对其进行修改以使其工作 - 所以请告诉我我们是否需要考虑这一点。该过程基本上采用相等的
user_id
和important_value
以及之前的effective_date
的值表和自连接。然后,我们在user_id
上再执行 1 个自联接,通过验证这些记录之间不存在effective_date
记录,有效地检查上面的 2 个联接记录是否是连续的2 条记录。现在它只是一个 select 语句 - 它应该选择所有要删除的记录。因此,如果您验证它返回了正确的数据,只需将
select *
更改为delete tcheck
即可。如果您有疑问,请告诉我。
OK, I think I figured this one out, excellent question!
First, I made the assumption that the
effective_date
column will not be duplicated for auser_id
. I think it can be modified to work if that is not the case - so let me know if we need to account for that.The process basically takes the table of values and self-joins on equal
user_id
andimportant_value
and prioreffective_date
. Then, we do 1 more self-join onuser_id
that effectively checks to see if the 2 joined records above are sequential by verifying that there is noeffective_date
record that occurs between those 2 records.It's just a select statement for now - it should select all records that are to be deleted. So if you verify that it is returning the correct data, simply change the
select *
todelete tcheck
.Let me know if you have questions.
好吧,伙计们,我昨晚思考了一些,我想我找到了答案。我希望这对那些必须匹配数据中的连续对并且由于某种原因也陷入 SQL Server 2000 中的人有所帮助。
我受到其他结果的启发,这些结果说使用
ROW_NUMBER()
,我使用了非常相似的方法,但有一个身份列。使用这种方法,我仍然必须迭代结果,直到它不返回任何内容,但我想不出任何解决方法,而且这种方法比我的上一种方法领先了数英里。
OK guys, I did some thinking last night and I think I found the answer. I hope this helps someone else who has to match consecutive pairs in data and for some reason is also stuck in SQL Server 2000.
I was inspired by the other results that say to use
ROW_NUMBER()
, and I used a very similar approach, but with an identity column.With this approach, I still have to iterate over the results until it returns nothing, but I can't think of any way around that and this approach is miles ahead of my last one.