用于比较数据库表的成对序列处理

发布于 2024-11-01 06:34:48 字数 388 浏览 6 评论 0原文

考虑以下用例：
我想并行迭代 2 个数据库表，并查找任一表中的差异和间隙/缺失记录。假设1）table的pk是一个Int ID字段； 2）按ID顺序读取表； 3) 任一表中的记录都可能丢失（具有相应的序列间隙）。

我想在每个数据库上一次性执行此操作 - 使用延迟读取。（该程序的初始版本使用序列对象和数据读取器 - 不幸的是，对每个数据库进行了多次传递）。

我考虑过使用成对序列处理并在迭代中使用 Seq.skip 来尝试保持表处理同步。然而，显然这非常慢，因为我 Seq.skip 的开销很高（在引擎盖下创建新序列），因此这可能是大表（例如 200k 条记录）的问题。

我想这是一种常见的设计模式（比较来自不同来源的并发数据流）并且对类似项目的反馈/评论/链接感兴趣。

有人愿意评论吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

血之狂魔 2024-11-08 06:34:48

这是我的（完全未经测试的）做法，对两个表进行一次传递：

let findDifferences readerA readerB =
    let idsA, idsB =
        let getIds (reader:System.Data.Common.DbDataReader) =
            reader |> LazyList.unfold (fun reader ->
                if reader.Read ()
                then Some (reader.GetInt32 0, reader)
                else None)
        getIds readerA, getIds readerB

    let onlyInA, onlyInB = ResizeArray<_>(), ResizeArray<_>()
    let rec impl a b =
        let inline handleOnlyInA idA as' = onlyInA.Add idA; impl as' b
        let inline handleOnlyInB idB bs' = onlyInB.Add idB; impl a bs'
        match a, b with
        | LazyList.Cons (idA, as'), LazyList.Cons (idB, bs') ->
                if   idA < idB then handleOnlyInA idA as'
                elif idA > idB then handleOnlyInB idB bs'
                else impl as' bs'
        | LazyList.Nil, LazyList.Nil  -> () // termination condition
        | LazyList.Cons (idA, as'), _ -> handleOnlyInA idA as'
        | _, LazyList.Cons (idB, bs') -> handleOnlyInB idB bs'
    impl idsA idsB
    onlyInA.ToArray (), onlyInB.ToArray ()

这需要两个 DataReader（每个表一个）并返回两个 int[] ，这表明仅出现在各自表中的 ID。该代码假定 ID 字段的类型为 int，并且位于序数索引 0 处。

另请注意，此代码使用 F# PowerPack 中的 LazyList，因此您将如果您还没有的话，需要获取它。如果您的目标是 .NET 4.0，那么我强烈建议您获取我构建和托管的 .NET 4.0 二进制文件此处，因为 F# PowerPack 站点中的二进制文件仅针对 .NET 2.0，有时不能很好地使用VS2010 SP1（有关详细信息，请参阅此线程：F# Powerpack 问题。找不到方法错误）。

Here's my (completely untested) take, doing a single pass over both tables:

let findDifferences readerA readerB =
    let idsA, idsB =
        let getIds (reader:System.Data.Common.DbDataReader) =
            reader |> LazyList.unfold (fun reader ->
                if reader.Read ()
                then Some (reader.GetInt32 0, reader)
                else None)
        getIds readerA, getIds readerB

    let onlyInA, onlyInB = ResizeArray<_>(), ResizeArray<_>()
    let rec impl a b =
        let inline handleOnlyInA idA as' = onlyInA.Add idA; impl as' b
        let inline handleOnlyInB idB bs' = onlyInB.Add idB; impl a bs'
        match a, b with
        | LazyList.Cons (idA, as'), LazyList.Cons (idB, bs') ->
                if   idA < idB then handleOnlyInA idA as'
                elif idA > idB then handleOnlyInB idB bs'
                else impl as' bs'
        | LazyList.Nil, LazyList.Nil  -> () // termination condition
        | LazyList.Cons (idA, as'), _ -> handleOnlyInA idA as'
        | _, LazyList.Cons (idB, bs') -> handleOnlyInB idB bs'
    impl idsA idsB
    onlyInA.ToArray (), onlyInB.ToArray ()

This takes two DataReaders (one for each table) and returns two int[]s which indicate the IDs that were only present in their respective table. The code assumes that the ID field is of type int and is at ordinal index 0.

Also note that this code uses LazyList from the F# PowerPack, so you'll need to get that if you don't already have it. If you're targeting .NET 4.0 then I strongly recommend getting the .NET 4.0 binaries which I've built and hosted here, as the binaries from the F# PowerPack site only target .NET 2.0 and sometimes don't play nice with VS2010 SP1 (see this thread for more info: Problem with F# Powerpack. Method not found error).

回复收藏 0 原文

沉睡月亮 2024-11-08 06:34:48

当您使用序列时，任何惰性函数都会增加序列的一些开销。对同一序列调用 Seq.skip 数千次显然会很慢。

您可以使用 Seq.zip 或 Seq.map2 一次处理两个序列：

> Seq.map2 (+) [1..3] [10..12];;
val it : seq<int> = seq [11; 13; 15]

如果 Seq 模块不够，您可能需要编写自己的函数。
我不确定我是否理解您尝试执行的操作，但此示例函数可能会对您有所帮助：

let fct (s1: seq<_>) (s2: seq<_>) =
    use e1 = s1.GetEnumerator()
    use e2 = s2.GetEnumerator()
    let rec walk () =

        // do some stuff with the element of both sequences
        printfn "%d %d" e1.Current e2.Current

        if cond1 then // move in both sequences
            if e1.MoveNext() && e2.MoveNext() then walk ()
            else () // end of a sequence

        elif cond2 then // move to the next element of s1
            if e1.MoveNext() then walk()
            else () // end of s1

        elif cond3 then // move to the next element of s2
            if e2.MoveNext() then walk ()
            else () // end of s2

    // we need at least one element in each sequence
    if e1.MoveNext() && e2.MoveNext() then walk()

编辑：

上一个函数旨在扩展 Seq 模块的功能，您可能会想要使其成为高阶函数。正如 ildjarn 所说，使用 LazyList 可以生成更清晰的代码：

let rec merge (l1: LazyList<_>) (l2: LazyList<_>) =
    match l1, l2 with
    | LazyList.Cons(h1, t1), LazyList.Cons(h2, t2) ->
        if h1 <= h2 then LazyList.cons h1 (merge t1 l2)
        else LazyList.cons h2 (merge l1 t2)
    | LazyList.Nil, l2 -> l2
    | _ -> l1

merge (LazyList.ofSeq [1; 4; 5; 7]) (LazyList.ofSeq [1; 2; 3; 6; 8; 9])

但我仍然认为您应该将数据的迭代与处理分开。编写一个高阶函数来进行迭代是一个好主意（最后，如果迭代器函数代码使用可变枚举器，这并不烦人）。

When you use sequences, any lazy function adds some overhead on the sequence. Calling Seq.skip thousands of times on the same sequence will clearly be slow.

You can use Seq.zip or Seq.map2 to process two sequences at a time:

> Seq.map2 (+) [1..3] [10..12];;
val it : seq<int> = seq [11; 13; 15]

If the Seq module is not enough, you might need to write your own function.
I'm not sure if I understand what you try to do, but this sample function might help you:

let fct (s1: seq<_>) (s2: seq<_>) =
    use e1 = s1.GetEnumerator()
    use e2 = s2.GetEnumerator()
    let rec walk () =

        // do some stuff with the element of both sequences
        printfn "%d %d" e1.Current e2.Current

        if cond1 then // move in both sequences
            if e1.MoveNext() && e2.MoveNext() then walk ()
            else () // end of a sequence

        elif cond2 then // move to the next element of s1
            if e1.MoveNext() then walk()
            else () // end of s1

        elif cond3 then // move to the next element of s2
            if e2.MoveNext() then walk ()
            else () // end of s2

    // we need at least one element in each sequence
    if e1.MoveNext() && e2.MoveNext() then walk()

Edit :

The previous function was meant to extend functionality of the Seq module, and you'll probably want to make it a high-order function. As ildjarn said, using LazyList can lead to cleaner code:

let rec merge (l1: LazyList<_>) (l2: LazyList<_>) =
    match l1, l2 with
    | LazyList.Cons(h1, t1), LazyList.Cons(h2, t2) ->
        if h1 <= h2 then LazyList.cons h1 (merge t1 l2)
        else LazyList.cons h2 (merge l1 t2)
    | LazyList.Nil, l2 -> l2
    | _ -> l1

merge (LazyList.ofSeq [1; 4; 5; 7]) (LazyList.ofSeq [1; 2; 3; 6; 8; 9])

But I still think you should separate the iteration of your data, from the processing. Writing a high-order function to iterate is a good idea (at the end, it's not annoying if the iterator function code uses mutable enumerators).

回复收藏 0 原文

~没有更多了~