为什么在此示例中使用序列比使用列表慢得多
背景: 我有一系列连续的带时间戳的数据。 数据序列中有漏洞,有些很大,有些只是一个缺失值。
每当漏洞只是一个缺失值时,我想使用虚拟值修补漏洞(较大的漏洞将被忽略)。
我想使用修补序列的延迟生成,因此我使用 Seq.unfold。
我已经制作了两个版本的方法来修补数据中的漏洞。
第一个使用其中有漏洞的数据序列并生成修补的序列。这就是我想要的,但是当输入序列中的元素数量超过 1000 时,这些方法运行速度非常慢,并且输入序列包含的元素越多,情况就会变得越来越糟。
第二种方法使用有漏洞的数据列表并生成修补的序列,并且运行速度很快。然而,这不是我想要的,因为这会强制在内存中实例化整个输入列表。
我想使用 (序列 -> 序列) 方法而不是 (列表 -> 序列) 方法,以避免同时将整个输入列表存储在内存中。
问题:
1)为什么第一种方法如此慢(随着输入列表的增大,情况逐渐变得更糟) (我怀疑这与使用 Seq.skip 1 重复创建新序列有关,但我不确定)
2)如何在使用输入序列时快速修补数据中的漏洞/strong> 而不是输入列表?
代码:
open System
// Method 1 (Slow)
let insertDummyValuesWhereASingleValueIsMissing1 (timeBetweenContiguousValues : TimeSpan) (values : seq<(DateTime * float)>) =
let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
(None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->
if restOfValues |> Seq.isEmpty then
None // Reached the end of the input seq
else
let currentValue = Seq.hd restOfValues
if prevValue.IsNone then
Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues )) // Only happens to the first item in the seq
else
let currentTime = fst currentValue
let prevTime = fst prevValue.Value
let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) // 42 is chosen here for obvious reasons, making this comment superfluous
Some(dummyValue, (Some(dummyValue), restOfValues))
else
Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch
// Method 2 (Fast)
let insertDummyValuesWhereASingleValueIsMissing2 (timeBetweenContiguousValues : TimeSpan) (values : (DateTime * float) list) =
let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
(None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->
match restOfValues with
| [] -> None // Reached the end of the input list
| currentValue::restOfValues ->
if prevValue.IsNone then
Some(currentValue, (Some(currentValue), restOfValues )) // Only happens to the first item in the list
else
let currentTime = fst currentValue
let prevTime = fst prevValue.Value
let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0)
Some(dummyValue, (Some(dummyValue), currentValue::restOfValues))
else
Some(currentValue, (Some(currentValue), restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch
// Test data
let numbers = {1.0..10000.0}
let contiguousTimeStamps = seq { for n in numbers -> DateTime.Now.AddMinutes(n)}
let dataWithOccationalHoles = Seq.zip contiguousTimeStamps numbers |> Seq.filter (fun (dateTime, num) -> num % 77.0 <> 0.0) // Has a gap in the data every 77 items
let timeBetweenContiguousValues = (new TimeSpan(0,1,0))
// The fast sequence-patching (method 2)
dataWithOccationalHoles |> List.of_seq |> insertDummyValuesWhereASingleValueIsMissing2 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))
// The SLOOOOOOW sequence-patching (method 1)
dataWithOccationalHoles |> insertDummyValuesWhereASingleValueIsMissing1 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))
Background:
I have a sequence of contiguous, time-stamped data.
The data-sequence has holes in it, some large, others just a single missing value.
Whenever the hole is just a single missing value, I want to patch the holes using a dummy-value (larger holes will be ignored).
I would like to use lazy generation of the patched sequence, and I am thus using Seq.unfold.
I have made two versions of the method to patch the holes in the data.
The first consumes the sequence of data with holes in it and produces the patched sequence. This is what i want, but the methods runs horribly slow when the number of elements in the input sequence rises above 1000, and it gets progressively worse the more elements the input sequence contains.
The second method consumes a list of the data with holes and produces the patched sequence and it runs fast. This is however not what I want, since this forces the instantiation of the entire input-list in memory.
I would like to use the (sequence -> sequence) method rather than the (list -> sequence) method, to avoid having the entire input-list in memory at the same time.
Questions:
1) Why is the first method so slow (getting progressively worse with larger input lists)
(I am suspecting that it has to do with repeatedly creating new sequences with Seq.skip 1, but I am not sure)
2) How can I make the patching of holes in the data fast, while using an input sequence rather than an input list?
The code:
open System
// Method 1 (Slow)
let insertDummyValuesWhereASingleValueIsMissing1 (timeBetweenContiguousValues : TimeSpan) (values : seq<(DateTime * float)>) =
let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
(None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->
if restOfValues |> Seq.isEmpty then
None // Reached the end of the input seq
else
let currentValue = Seq.hd restOfValues
if prevValue.IsNone then
Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues )) // Only happens to the first item in the seq
else
let currentTime = fst currentValue
let prevTime = fst prevValue.Value
let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) // 42 is chosen here for obvious reasons, making this comment superfluous
Some(dummyValue, (Some(dummyValue), restOfValues))
else
Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch
// Method 2 (Fast)
let insertDummyValuesWhereASingleValueIsMissing2 (timeBetweenContiguousValues : TimeSpan) (values : (DateTime * float) list) =
let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
(None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->
match restOfValues with
| [] -> None // Reached the end of the input list
| currentValue::restOfValues ->
if prevValue.IsNone then
Some(currentValue, (Some(currentValue), restOfValues )) // Only happens to the first item in the list
else
let currentTime = fst currentValue
let prevTime = fst prevValue.Value
let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0)
Some(dummyValue, (Some(dummyValue), currentValue::restOfValues))
else
Some(currentValue, (Some(currentValue), restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch
// Test data
let numbers = {1.0..10000.0}
let contiguousTimeStamps = seq { for n in numbers -> DateTime.Now.AddMinutes(n)}
let dataWithOccationalHoles = Seq.zip contiguousTimeStamps numbers |> Seq.filter (fun (dateTime, num) -> num % 77.0 <> 0.0) // Has a gap in the data every 77 items
let timeBetweenContiguousValues = (new TimeSpan(0,1,0))
// The fast sequence-patching (method 2)
dataWithOccationalHoles |> List.of_seq |> insertDummyValuesWhereASingleValueIsMissing2 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))
// The SLOOOOOOW sequence-patching (method 1)
dataWithOccationalHoles |> insertDummyValuesWhereASingleValueIsMissing1 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
每当您使用
Seq.hd
和Seq.skip 1
分解 seq 时,您几乎肯定会陷入 O(N^2) 的陷阱。IEnumerable
对于递归算法(包括例如Seq.unfold
)来说是一个糟糕的类型,因为这些算法几乎总是具有“第一个元素”和“其余元素”的结构元素”,并且没有有效的方法来创建一个新的 IEnumerable 来表示“元素的其余部分”。 (IEnumerator
是可行的,但它的 API 编程模型不太有趣/易于使用。)如果您需要原始数据“保持惰性”,那么您应该使用 LazyList (在 F# PowerPack 中)。如果您不需要惰性,那么您应该使用具体的数据类型,例如“列表”,您可以在 O(1) 中对其进行“拖尾”。
(您还应该查看 避免堆栈溢出(使用 F# 无限序列序列)仅供参考,尽管它只适用于这个问题。)
Any time you break apart a seq using
Seq.hd
andSeq.skip 1
you are almost surely falling into the trap of going O(N^2).IEnumerable<T>
is an awful type for recursive algorithms (including e.g.Seq.unfold
), since these algorithms almost always have the structure of 'first element' and 'remainder of elements', and there is no efficient way to create a newIEnumerable
that represents the 'remainder of elements'. (IEnumerator<T>
is workable, but its API programming model is not so fun/easy to work with.)If you need the original data to 'stay lazy', then you should use a LazyList (in the F# PowerPack). If you don't need the laziness, then you should use a concrete data type like 'list', which you can 'tail' into in O(1).
(You should also check out Avoiding stack overflow (with F# infinite sequences of sequences) as an FYI, though it's only tangentially applicable to this problem.)
Seq.skip 构造一个新序列。我认为这就是为什么你原来的方法很慢。
我的第一个倾向是使用序列表达式和 Seq.pairwise。这是快速且易于阅读的。
Seq.skip constructs a new sequence. I think that is why your original approach is slow.
My first inclination is to use a sequence expression and Seq.pairwise. This is fast and easy to read.