C# 多线程文件读取和页面解析
我有一个包含超过 500 000 个网址的文件。现在我想读取该文件并使用返回字符串消息的函数解析每个网址。目前,一切工作正常,但性能不佳,因此我需要在模拟线程中开始解析(例如 100 个线程),
ParseEngine parseEngine = new ParserEngine(parseFormulas);
StreamReader reader = new StreamReader("urls.txt");
String line = string.Empty;
while ((line = reader.ReadLine()) != null)
{
string result = parseEngine.Parse(line);
Console.WriteLine(result);
}
reader.Close();
当我可以通过单击按钮停止所有线程并更改线程数时,效果会很好。有什么帮助和提示吗?
I have a file with more than 500 000 urls. Now I want to read the file and parse every url with my function which return string message. For now everyting is working fine but the performance is not good so I need start the parsing in simulataneus threads (for example 100 threads)
ParseEngine parseEngine = new ParserEngine(parseFormulas);
StreamReader reader = new StreamReader("urls.txt");
String line = string.Empty;
while ((line = reader.ReadLine()) != null)
{
string result = parseEngine.Parse(line);
Console.WriteLine(result);
}
reader.Close();
It will be good when I can stop all the threads by button clicking and change the number of threads. Any help and tips?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
请务必查看这篇关于 PLINQ 性能的文章与其他解析 a 技术的比较文本文件,逐行,使用多线程。
它不仅提供了示例源代码来执行与您想要的几乎相同的操作,而且他们还发现了 PLINQ 的一个“陷阱”,该“陷阱”可能会导致异常缓慢的时间。简而言之,如果您尝试使用 File.ReadAllLines() 或 StreamReader.ReadLine(),则会破坏性能,因为 PLINQ 无法以这种方式正确划分文件。他们通过将所有行读入索引数组,然后使用 PLINQ 对其进行处理来解决了该问题。
Be sure to check out this article on PLINQ performance compared to other techniques for parsing a text file, line-by-line, using multi-threading.
Not only does it provide sample source code for doing something almost identical to what you want, but they also discovered a "gotcha" with PLINQ that can result in abnormally slow times. In a nutshell, if you try to use File.ReadAllLines() or StreamReader.ReadLine() you'll spoil the performance because PLINQ can't properly divide the file up that way. They solved the problem by reading all the lines into an indexed array, and THEN processing it with PLINQ.
老实说,对于性能差异,如果可以的话,我会在 .net 4.0 中尝试并行 foreach。
这是并行运行事物的良好开始,并且应该最大限度地减少线程故障排除的麻烦。
Honestly for the performance difference I would just try parallel foreach in .net 4.0 if that is an option.
Its a decent start to running things parallel and should minimize your thread troubleshooting headaches.
生产者/消费者设置对此很有帮助。一个线程从文件中读取数据并将其写入队列,其他线程可以从队列中读取。
您提到了 100 个线程的示例。如果您有这么多线程,您可能希望批量从队列中读取数据,因为您可能必须在读取之前锁定队列,因为队列仅对单个读取器+写入器来说是线程安全的。
我认为 4.0 中有一个新的 ConcurrentQueue 泛型,但我不太记得了。
您实际上只需要一名读者来阅读该文件。
A producer/consumer setup would be good for this. One thread reading from the file and writing to a Queue, and the other threads can read from the queue.
You mentioned and example of 100 threads. If you had this many threads, you would want to read from the Queue in batches, since you'd probably have to lock the Queue before reading as a Queue is only thread safe for a single reader+writer.
I think there is a new ConcurrentQueue generic in 4.0, but I can't remember for sure.
You really only want one reader to the file.
您可以使用 Parallel.ForEach() 为列表中的每个项目安排一个线程。假设 parseEngine 需要一些时间来运行,这会将线程分散到所有可用的处理器中。如果 parseEngine 运行得很快(定义为小于 250 毫秒),请通过调用 ThreadPool.SetMinThreads() 增加“按需”线程的数量,这将导致同时执行更多线程。
You could use Parallel.ForEach() to schedule a thread for each item in the list. That would spread the threads out among all available processors, assuming that parseEngine takes some time to run. If parseEngine runs pretty quickly (defined as less than 250ms), increase the number of "on-demand" threads by calling ThreadPool.SetMinThreads(), which will result in more threads executing at once.