分割重叠日期范围的最快方法
我在 SQL DB 表中有日期范围数据,其中包含以下三列(仅相关):
ID
(int 身份)RangeFrom
(仅日期)RangeTo
(仅限日期)
对于任何给定的日期范围,可能有任意数量的记录可能重叠(完全或部分)。
条件
- 每个具有较高 ID 的记录(较新的记录)优先于可能重叠(完全或部分)的旧记录
- 范围至少为 1 天(
RangeFrom
和RangeTo
相差一天)
因此,对于给定的日期范围(不超过即 5 年),我必须
- 获取属于该范围的所有范围记录(全部或部分),
- 将这些重叠部分拆分为非重叠范围
- 返回这些新的非重叠范围
我的看法
由于有很多与这些范围相关的复杂数据(大量联接等),并且由于处理器+内存能力比 SQL DB 引擎高效得多,我决定宁愿从 DB 加载重叠数据到我的数据层并在内存中进行范围切割/分割。这给了我在开发和执行方面更大的灵活性和速度。
如果您认为这应该在数据库中更好地处理,请告诉我。
问题
我想编写最快且尽可能不占用资源的转换算法。由于我获得了大量这些记录并且它们与各个用户相关,因此我必须为每个用户及其重叠范围数据集运行此算法。
分割这些重叠范围的最有效(快速且不占用资源)的方法是什么?
示例数据
我有记录ID=1
到ID=5
,它们以这种方式在视觉上重叠(日期实际上不相关,我可以通过这种方式更好地显示这些重叠):
6666666666666
44444444444444444444444444 5555555555
2222222222222 333333333333333333333 7777777
11111111111111111111111111111111111111111111111111111111111111111111
结果应该看起来像:
111111166666666666664444444444444444444444333333333555555555511111117777777
结果实际上看起来就像我们从顶部查看这些重叠,然后获取我们从这个自上而下的视图中看到的 ID。
结果实际上会转换为新的范围记录,因此旧的 ID 变得无关紧要。但将使用它们的 RangeFrom
和 RangeTo
值(以及所有相关数据):
111111122222222222223333333333333333333333444444444555555555566666667777777
这当然只是重叠范围的示例。对于任何给定的日期范围,它可以是从 0 条记录到 X 条记录的任意值。正如我们所看到的,范围 ID=2 被 4 和 6 完全覆盖,因此它完全过时了。
I have date range data in SQL DB table that has these three (only relevant) columns:
ID
(int identity)RangeFrom
(date only)RangeTo
(date only)
For any given date range, there may be an arbitrary number of records that may overlap (completely or partially).
Conditions
- Every record with higher
ID
(newer record) takes precedence over older records that it may overlap (fully or partially) - Ranges are at least 1 day long (
RangeFrom
andRangeTo
differ by one day)
So for a given date range (not longer than ie. 5 years) I have to
- get all range records that fall into this range (either fully or partially)
- split these overlaps into non-overlapping ranges
- return these new non overlapping ranges
My take on it
Since there's a lot of complex data related to these ranges (lots of joins etc etc) and since processor + memory power is much more efficient than SQL DB engine I decided to rather load overlapping data from DB to my data layer and do the range chopping/splitting in memory. This give me much more flexibility as well as speed in terms of development and execution.
If you think this should be better handled in DB let me know.
Question
I would like to write the fastest and if at all possible also resource non-hungry conversion algorithm. Since I get lots of these records and they are related to various users I have to run this algorithm for each user and its set of overlapping ranges data.
What would be the most efficient (fast and non resource hungry) way of splitting these overlapping ranges?
Example data
I have records ID=1
to ID=5
that visually overlap in this manner (dates are actually irrelevant, I can better show these overlaps this way):
6666666666666
44444444444444444444444444 5555555555
2222222222222 333333333333333333333 7777777
11111111111111111111111111111111111111111111111111111111111111111111
Result should look like:
111111166666666666664444444444444444444444333333333555555555511111117777777
Result actually looks like as if we'd be looking at these overlaps from the top and then get IDs that we see from this top-down view.
Result will actually get transformed into new range records, so old IDs become irrelevant. But their RangeFrom
and RangeTo
values (along with all related data) will be used:
111111122222222222223333333333333333333333444444444555555555566666667777777
This is of course just an example of overlapping ranges. It can be anything from 0 records to X for any given date range. And as we can see range ID=2 got completely overwritten by 4 and 6 so it became completely obsolete.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我不太确定这会有多大用处,但我处理这个问题的方式......
(为了便于理解,首先未进行优化...)
(按日期排序,每个日期 - 无论开始还是结束,都是直到下一个日期的时间范围的开始。)
这样您的表将如下所示:
由于您的最终数据库中将有重复的 ID(在示例中“1”被分成两段),因此最终将数据库保留为日期-> ID 格式而不是 ID-> 范围似乎是首选。
现在进行明显的优化 - 当然不要保留每个日期记录的 ID 列表。只需用空ID填充日期->ID表,并在填充最终记录时,替换迄今为止找到的最大记录:
添加新记录就像创建操作的一次迭代。另一方面,删除记录似乎相当棘手。
I'm not quite sure how useful that would be but the way I would approach this...
(first unoptimized for easy understanding...)
(sort by date, each date - no matter, start or end, is a start of a time range reaching until next date.)
So that your table would look like:
Since you will have duplicate IDs in your final database ("1" gets split over two segments in your example), keeping the database in date->ID format as opposed to ID->range seems preferred in the end.
Now for obvious optimizations - of course don't keep the list of IDs with each date record. Just fill in a date->ID table with null IDs, and while filling it in with final records, replace value largest record found so far:
Adding new record is just like one iteration of the operation of creation. Removing a record, on the other hand, seems to be quite tricky.
实际上,您想要堆叠数据,并从堆栈中选择最大值。我之前必须实现与此类似的东西以及我们使用的方法,这给了我们比您需要的更多的灵活性,因此这样做可能不合适:
拥有一个用于管理记录的对象,并将每个记录添加到这个物体。添加记录时,创建一个新的日期范围并将记录的值与该范围相关联。然后检查该范围是否与任何其他现有范围重叠。如果确实重叠,则为每个重叠创建一个新范围,并将两个/全部上的所有值(取决于您是在添加每个范围时执行此操作还是在单遍中执行此操作)将重叠范围与新范围相关联。这可以在添加数据时完成,也可以在添加所有数据后一次性完成。
最后,您有一个包含唯一范围的对象,每个范围都有一个与其关联的值的集合,有点像上面的图片。
然后,您可以提供一个具有展平函数的类(可能使用策略模式),该函数会将具有值集合的唯一范围转换为具有单个值的唯一范围,这显然会连接最终具有相同值的范围。
您可能需要一个从每个唯一范围中选择最大值的类,但您可能还想选择最小值、对值求和、求平均值、对它们进行计数等。每个选项都可以通过传递不同的实现来完成的策略。
正如我所说,这种方法可能比仅选择最大值的方法效率低,因为在这种情况下您不需要将所有值保留在堆栈中,但据我记得,实现相当简单。
Effectively you want to stack the data, and select the maximum from the stack. I have had to implement something similar to this before and the approach we used, which gave us a bit more flexibility than you require, so may not be appropriate was to do this:
Have an object for managing the records, and add each record to this object. when a record is added create a new date range and associate the value of the record with the range. Then check if the range overlaps with any other existing range. If it does overlap, then create a new range for each overlap and associate all of the values on the both/all (depending on if you do it as you add each range, or in a single pass) the overlapped ranges with the new range. This can either be done as you add the data, or in a single pass once all data has been added.
At the end you have a object which contains unique ranges, each of which has a collection of values associated with it, a bit like your picture above.
You can then provide a class with a flattening function (probably using the strategy pattern) which will convert the unique ranges with collections of values into unique ranges with a single value, this will obviously concatenate ranges which end up with the same value.
You would want a class which selects the maximum value from each unique range, but you might also want to select the minimum value, sum the values, average them, count them etc etc. Each of these options can be done by passing a different implementation of the strategy.
As I said this approach might be less efficient than an approach which only selects the maximum value, as you wouldn't need to keep all the values in the stack in that case, but the implementation was fairly straight forward as I remember.
对于给定的日期范围
我想出了一个自己的想法:
用
null
值填充数组。所有这些。按 ID 以相反顺序对记录进行排序
通过迭代有序记录来展平重叠范围,并对每个项目执行以下操作:
null
您最终会得到一个扁平范围数组,并填充记录 ID
create新的记录集并在数组中的 ID 更改时创建每个新记录。每个记录应使用与数组中设置的记录 ID 关联的数据
为下一个人及其重叠范围集重复整个过程(不要忘记重复使用同一数组)。 = 返回步骤 2。
基本上就是这样。
给定的 10 年日期范围需要大约 10 年的数组。 3650 个可空整数,我认为这是相当小的内存占用(每个整数占用 4 个字节,但我不知道具有
int
和bool
的可空整数占用多少空间 但假设 8 个字节,总计 3650*8 = 28.52k),并且可以在内存中轻松且相当快速地进行操作。由于我没有保存日期范围、拆分或类似的内容,因此这些只是带有 if 的赋值操作,用于检查值是否已设置。10 年的日期范围是罕见的极端夸张。 75% 的日期范围将在 3 个月或一年的四分之一内(90 天 * 8 字节 = 720 字节),99% 将落在全年范围内(365*8 = 2920 字节 = 2,85k
)发现该算法更适合展平重叠的日期范围。
为了减少一半的内存占用,我可以使用
int
而不是int?
并设置为-1
而不是null
。过早迭代循环中断的可能性
我也可以保留未设置的天数,当它达到 0 时,我可以轻松地中断循环,因为所有剩余范围完全重叠,因此它们不会在数组中设置任何更多值。因此,当我有大量范围记录(这将是相当罕见的)时,这甚至会加快速度。
How about an array of nullable integers
I've come up with an idea of my own:
for the given date range I would create an in memory array of integers with as many items as there are days in the range.
fill array with
null
values. All of them.order records by ID in reverse order
flatten overlapped ranges by iterating over ordered records and do the following on each item:
null
you end up with an array of flattened ranges and filled with record IDs
create new set of records and create each new record when ID in array changes. Each record should use data associated with the record ID as set in array
Repeat the whole thing for next person and its set of overlapped ranges (don't forget to reuse the same array). = go back to step 2.
And that's it basically.
A 10 years given date range requires an array of approx. 3650 nullable integers, which I think is rather small memory footprint (each integer taking 4 bytes, but I don't know how much space occupies a nullable integer that has an
int
andbool
but lets assume 8 bytes which totals at 3650*8 = 28.52k) and can be easily and rather fast manipulate in memory. Since I'm not saving date ranges, splitting or anything similar these are barely just assignment operations with an if that checks whether value has already been set.A 10 year date range is a rare exaggeratet extreme. 75% of date ranges will be within 3 months or quarter of a year (90 days * 8 bytes = 720 bytes) and 99% will fall in a range of a whole year (365*8 = 2920 bytes = 2,85k)
I find this algorithm more than appropriate for flattening overlapped date ranges.
To half memory footprint I could use
int
instead ofint?
and set to-1
instead ofnull
.A premature iteration loop break possibility
I could as well keep a count of days that aren't set and when it reaches 0 I can easily break the loop, because all remaining ranges are fully overlapped hence they wouldn't set any more values in array. So this would even speed things up a bit when I would have lots of range records (which will be rather rare).
免费的.NET 时间周期库包含工具TimePeriodIntersector,它与各种重叠的时间范围相交。
该算法使用时间线并枚举一个时间范围内的所有时刻(计算每个时刻的开始/结束点):
ID 映射应该是一项简单的任务。
The free Time Period Library for .NET includes the tool TimePeriodIntersector, which intersects various overlapping time ranges.
The algorithm uses a timeline and enumerates all moments within a time range (counting start/end points per moment):
The ID mapping should be an easy task.