利用 Parallel / PLINQ 在所有 Excel 工作表单元格中查找关键字的最佳方法
作为标题,我有一个 List
还有一个类似于 Excel 的 Workbook
对象模型。
我想获取与列表中的关键字匹配的所有 WorkbookCell
。
我在想也许并行搜索会是一个好主意:
//Loop through all the Worksheets in parallel
Parallel.ForEach(Workbook.Worksheets, (ws, st) =>
{
if (!st.ShouldExitCurrentIteration)
{
//Loop through all the rows in parallel
Parallel.ForEach(ws.Rows, (wr, tk) =>
{
if (!tk.ShouldExitCurrentIteration)
{
//Loop through all the columns in parallel
Parallel.ForEach(wr.Cells, (cell, ctk) =>
{
if (cell.Value != null)
{
var cellValue = cell.Value.ToString();
//Block keyword found, add the occurance
var matchedKeyword = IsKeywordMatched(cellValue);
if (matchedKeyword != null)
{
matchedKeyword.AddMatchedCell(cell);
}
}
});
}
});
}
});
事实上这会不会太多并行?如果您有更好的想法,请告诉我。
** 正常情况下我的工作表少于 20 个,但每个工作表将包含超过 10000 行和数百列。
As title, I have a List<string> keywords;
and also a Workbook
object model that similar to Excel.
I would like to get all the WorkbookCell
that matches the keywords in the list.
I was thinking maybe Parallel the searching would be an good idea:
//Loop through all the Worksheets in parallel
Parallel.ForEach(Workbook.Worksheets, (ws, st) =>
{
if (!st.ShouldExitCurrentIteration)
{
//Loop through all the rows in parallel
Parallel.ForEach(ws.Rows, (wr, tk) =>
{
if (!tk.ShouldExitCurrentIteration)
{
//Loop through all the columns in parallel
Parallel.ForEach(wr.Cells, (cell, ctk) =>
{
if (cell.Value != null)
{
var cellValue = cell.Value.ToString();
//Block keyword found, add the occurance
var matchedKeyword = IsKeywordMatched(cellValue);
if (matchedKeyword != null)
{
matchedKeyword.AddMatchedCell(cell);
}
}
});
}
});
}
});
Would this be too much of parallel in fact? Please let me know if you have better ideas.
** I have less than 20 worksheets in normal case, but every worksheet will contains more than 10000 of rows and hundreds of columns.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
默认的并行线程数等于每个默认的核心数。
每个并行循环都与将数据拆分(聚类)为 n 个部分并再次合并它们的开销相关。我想说的是,如果工作表的数量大于常见情况下的核心数量,则仅使用第一个循环是有意义的,否则在第二级拆分数据。嵌套并行循环只会降低性能。因此,是的,你是对的,并行性太多了。
The default number of parallel threads is equal to the number of cores per default.
Each parallel loop is related to the overhead of splitting (clustering) the data into n portions and merging them again. I wold say it makes sense to live only the first loop if the number of worksheets is greater then number of cores in common case, otherwise split data on the second level. Nested parallel loops will only decrease performance. Thus yes, you are right it's too much parallelism.
对我来说,这看起来是并行的一个很好的候选者......
应该为您提供几乎线性的性能改进与可用核心数量的比较。
提示:将 IsKeywordMatched 函数更改为 KeywordMatched,该函数返回匹配的字符串,如果没有任何内容,则返回 NULL。然后通过 stinr 不为空的记录过滤结果查询 (.Where(...))。
This looks as a good candidate for paralleling for me...
Should give you almost linear performance improvement vs. number of cores available.
HINT: Change your IsKeywordMatched function to KeywordMatched, which returns the string matched or NULL if nothing is there. Then filter the resulting query (.Where(...)) by the records where stinr is not null.