利用 Parallel / PLINQ 在所有 Excel 工作表单元格中查找关键字的最佳方法

发布于 2024-12-04 20:53:42 字数 1604 浏览 1 评论 0原文

作为标题,我有一个 List;关键词; 还有一个类似于 Excel 的 Workbook 对象模型。

我想获取与列表中的关键字匹配的所有 WorkbookCell

我在想也许并行搜索会是一个好主意:

            //Loop through all the Worksheets in parallel
            Parallel.ForEach(Workbook.Worksheets, (ws, st) =>
            {
                if (!st.ShouldExitCurrentIteration)
                {
                    //Loop through all the rows in parallel
                    Parallel.ForEach(ws.Rows, (wr, tk) =>
                    {
                        if (!tk.ShouldExitCurrentIteration)
                        {
                            //Loop through all the columns in parallel
                            Parallel.ForEach(wr.Cells, (cell, ctk) =>
                            {
                                if (cell.Value != null)
                                {
                                    var cellValue = cell.Value.ToString();

                                    //Block keyword found, add the occurance
                                    var matchedKeyword = IsKeywordMatched(cellValue);

                                    if (matchedKeyword != null)
                                    {
                                        matchedKeyword.AddMatchedCell(cell);
                                    }
                                }
                            });
                        }
                    });
                }
            });

事实上这会不会太多并行?如果您有更好的想法,请告诉我。

** 正常情况下我的工作表少于 20 个,但每个工作表将包含超过 10000 行和数百列。

As title, I have a List<string> keywords;
and also a Workbook object model that similar to Excel.

I would like to get all the WorkbookCell that matches the keywords in the list.

I was thinking maybe Parallel the searching would be an good idea:

            //Loop through all the Worksheets in parallel
            Parallel.ForEach(Workbook.Worksheets, (ws, st) =>
            {
                if (!st.ShouldExitCurrentIteration)
                {
                    //Loop through all the rows in parallel
                    Parallel.ForEach(ws.Rows, (wr, tk) =>
                    {
                        if (!tk.ShouldExitCurrentIteration)
                        {
                            //Loop through all the columns in parallel
                            Parallel.ForEach(wr.Cells, (cell, ctk) =>
                            {
                                if (cell.Value != null)
                                {
                                    var cellValue = cell.Value.ToString();

                                    //Block keyword found, add the occurance
                                    var matchedKeyword = IsKeywordMatched(cellValue);

                                    if (matchedKeyword != null)
                                    {
                                        matchedKeyword.AddMatchedCell(cell);
                                    }
                                }
                            });
                        }
                    });
                }
            });

Would this be too much of parallel in fact? Please let me know if you have better ideas.

** I have less than 20 worksheets in normal case, but every worksheet will contains more than 10000 of rows and hundreds of columns.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

幽蝶幻影 2024-12-11 20:53:42

默认的并行线程数等于每个默认的核心数。
每个并行循环都与将数据拆分(聚类)为 n 个部分并再次合并它们的开销相关。我想说的是,如果工作表的数量大于常见情况下的核心数量,则仅使用第一个循环是有意义的,否则在第二级拆分数据。嵌套并行循环只会降低性能。因此,是的,你是对的,并行性太多了。

The default number of parallel threads is equal to the number of cores per default.
Each parallel loop is related to the overhead of splitting (clustering) the data into n portions and merging them again. I wold say it makes sense to live only the first loop if the number of worksheets is greater then number of cores in common case, otherwise split data on the second level. Nested parallel loops will only decrease performance. Thus yes, you are right it's too much parallelism.

不必在意 2024-12-11 20:53:42

对我来说,这看起来是并行的一个很好的候选者......

worksheet.Cells.AsParallel().Select(x => new{x,KeywordMatched(x.Value.ToString())}).Where(...)...

应该为您提供几乎线性的性能改进与可用核心数量的比较。

提示:将 IsKeywordMatched 函数更改为 KeywordMatched,该函数返回匹配的字符串,如果没有任何内容,则返回 NULL。然后通过 stinr 不为空的记录过滤结果查询 (.Where(...))。

This looks as a good candidate for paralleling for me...

worksheet.Cells.AsParallel().Select(x => new{x,KeywordMatched(x.Value.ToString())}).Where(...)...

Should give you almost linear performance improvement vs. number of cores available.

HINT: Change your IsKeywordMatched function to KeywordMatched, which returns the string matched or NULL if nothing is there. Then filter the resulting query (.Where(...)) by the records where stinr is not null.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文