如何优化统计计数序列以及为什么它运行如此慢

发布于 2024-11-06 07:03:53 字数 440 浏览 4 评论 0原文

简介：我花了一整天的时间寻找为什么我的处理操作如此缓慢。数据量低时速度确实很慢。我检查了 sql 视图、过程和 linq 逻辑 - 所有这些都运行得很好。但后来我发现这个小事情需要很长时间才能处理。

member X.CountStatistics()= 
    linq.TrueIncidents
    |> PSeq.groupBy (fun v -> v.Name)
    |> PSeq.map (fun (k, vs) -> k, PSeq.length vs)
    |> Array.ofSeq

它只是计算分组值，但它花费了多少时间！在 Easy 表上大约 10 秒，

一定有一些愤怒的递归，但我看不到它......

我怎样才能使这个操作“更快一点”或将其重新编码为 linq-to-sql ？

原文

intro : I spent whole day looking why my processing operation is so so slow. It was really slow on low data. I checked sql views , procedures , and linq logics - and all of them worked perfect. but then I saw the little thing takes ages to process.

member X.CountStatistics()= 
    linq.TrueIncidents
    |> PSeq.groupBy (fun v -> v.Name)
    |> PSeq.map (fun (k, vs) -> k, PSeq.length vs)
    |> Array.ofSeq

It simply counts grouped values but how much time it spends ! about 10 seconds on easy table,

There must be something angry recursive but I can't see it...

How can I make this operation "a bit faster" or recode it to linq-to-sql ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

噩梦成真你也成魔 2024-11-13 07:03:53

如果我理解正确的话，TrueIncidents 是数据库中的一个表，您将整个内容提取到客户端应用程序中以进行一些分组和计数。如果 TrueIncidents 是一个大表，那么此操作总是会很慢，因为您要移动大量数据。执行此操作的“正确”方法是在数据库上执行此操作，正如您建议使用 linq to SQL 或 Tomas 建议使用存储过程一样。

关于 PSeq，我认为内联不会产生太大影响。并行化会产生开销，并且为了分摊此开销，列表需要相对较大，并且对列表中的每个项目执行的操作需要很大。如果您对每个项目执行的操作非常昂贵，那么对于一个小列表来说，并行化可能是值得的，但事实恰恰相反；即使列表非常大，并行化一个小操作也不值得花费这些开销。因此，这种情况下的问题是您对列表中的每个项目执行的操作太小，因此并行化的成本总是会使操作变慢。要看到这一点，请考虑以下 C# 程序，如果我们对包含 1000 万个项目的列表执行简单的加法，您会发现并行版本总是运行缓慢（好吧，在我目前正在使用的机器上，它有两个核心，我想在具有更多核心的机器上结果可能会有所不同）。

    static void Main(string[] args)
    {
        var list = new List<int>();
        for (int i = 0; i < 10000000; i++)
        {
            list.Add(i);
        }

        var stopwatch = new Stopwatch();
        stopwatch.Start();
        var res1 = list.Select(x => x + 1);
        foreach (var i in res1)
        {

        }
        stopwatch.Stop();
        Console.WriteLine(stopwatch.Elapsed);
        // 00:00:00.1950918 sec on my machine

        stopwatch.Start();
        var res2 = list.Select(x => x + 1).AsParallel();
        foreach (var i in res2)
        {

        }
        stopwatch.Stop();
        Console.WriteLine(stopwatch.Elapsed);
        // 00:00:00.3748103 sec on my machine
    }

If I understand correctly, TrueIncidents is a table in a db, you're pulling the entire contents into a client app to do some grouping and counting. If TrueIncidents is a large table then this operation is always going to be slow since you’re moving a large amount of data around. The “correct” way to do this to do this is on the database, as you suggest using linq to SQL, or as Tomas suggest using a stored procedure.

Regarding PSeq, I do not thinking inlining will make much of a difference. Parallelization has an overhead and for this overhead to amortize the list needs to be relatively large and the operation you perform on each item in the list needs to be significant. A parallelizing may worth it for a small list if the operation you’re performing on each item is very expensive, however the reverse does seem to be true; even if a list is very large parallelizing a small operation will not be worth the overhead. So, the problem in this case is the operation you perform on each item in the list is too small, so the cost of the parallelization will always make the operation slower. To see this consider the following C# program were we perform a simple addition on a list with 10 million items, you’ll see that the parallel version always runs slow (well, on the machine I’m working on at the moment, which has two cores, I guess on a machine with more cores the result might be different).

    static void Main(string[] args)
    {
        var list = new List<int>();
        for (int i = 0; i < 10000000; i++)
        {
            list.Add(i);
        }

        var stopwatch = new Stopwatch();
        stopwatch.Start();
        var res1 = list.Select(x => x + 1);
        foreach (var i in res1)
        {

        }
        stopwatch.Stop();
        Console.WriteLine(stopwatch.Elapsed);
        // 00:00:00.1950918 sec on my machine

        stopwatch.Start();
        var res2 = list.Select(x => x + 1).AsParallel();
        foreach (var i in res2)
        {

        }
        stopwatch.Stop();
        Console.WriteLine(stopwatch.Elapsed);
        // 00:00:00.3748103 sec on my machine
    }

回复收藏 0 原文

蓝咒 2024-11-13 07:03:53

当前版本的 F# LINQ 支持有点有限。

我认为编写此代码的最佳方法是牺牲一些使用 F# 的优雅性，并将其编写为 SQL 中的存储过程。然后，您可以将存储过程添加到 linq 数据上下文中，并使用生成的方法很好地调用它。当 F# LINQ 将来有所改进时，您可以将其改回来:-)。

关于 PSeq 示例 - 据我所知，存在一些效率问题，因为这些方法没有内联（由于内联，编译器能够进行一些额外的优化，并消除了一些开销）。您可以尝试下载源代码并将inline添加到map和groupBy。

回复收藏 0 原文

一身仙ぐ女味 2024-11-13 07:03:53

正如其他答案中已经提到的，如果你从数据库中获取大量数据，然后对这个大数据集进行一些计算，那么它会很昂贵（我认为 IO 部分会比计算部分更昂贵）。在您的具体情况下，您似乎想要每个事件名称的计数。一种方法是使用 F# linq-sql，只需从数据库中获取事件的“名称”（没有其他列，因为您不需要它们），然后在 F# 中进行分组和映射操作。它可能会帮助您提高性能，但不确定会提高多少。

回复收藏 0 原文

~没有更多了~