如何最好地并行化网页解析?
我正在使用 html 敏捷包来解析论坛网站的各个页面。因此,解析方法返回页面链接上的所有主题/线程链接,作为参数传递。我将所有已解析页面的所有这些主题链接收集在一个集合中。
之后,我检查它们是否在我已查看的网址的 Dictionary
中,如果不在,则将它们添加到新列表中,并且 UI 显示此列表,这基本上是新主题/自上次以来创建的线程。
由于所有这些操作看起来都是独立的,那么并行化的最佳方法是什么?
我应该使用 .NET 4.0 的 Parallel.For/ForEach
吗?
不管怎样,我怎样才能将每个页面的结果收集到一个集合中?或者说这是没有必要的?
每当解析方法完成时,我是否可以从我的集中式Dictionary
中读取它们,以查看它们是否同时存在?
如果我运行这个程序 4000 个页面,大约需要 90 分钟,如果我可以使用所有 8 个核心在大约 10 分钟内完成相同的任务,那就太好了。
I am using the html agility pack to parse individual pages of a forum website. So the parsing method returns all the topic/thread links on the page link, passed as an argument. I gather all these topic links of all the parsed pages in a single collection.
After that, I check if they are on my Dictionary
of already-viewed urls, and if they are not, then I add them to a new list and the UI shows this list, which is basically new topics/threads created since last time.
Since all these operations seem independent, what would be the best way to parallelize this?
Should I use .NET 4.0's Parallel.For/ForEach
?
Either way, how can I gather the results of each page in a single collection? Or is this not necessary?
Can I read from my centralized Dictionary
whenever a parse method finishes to see if they are there, simultaneously?
If I run this program for 4000 pages, it takes like 90 mins, it would be great if I could use all my 8 cores to finish the same task in ~10 mins.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Parallel.For/ForEach
与ConcurrentDictionary< 结合使用;TKey, TValue>
在不同线程之间共享状态似乎是实现此目的的好方法。并发字典确保多个线程的安全读/写。Parallel.For/ForEach
combined with aConcurrentDictionary<TKey, TValue>
to share state between different threads seem like a good way to implement this. The concurrent dictionary ensures safe read/write from multiple threads.您当然可以使用 Parallel.For/ForEach 来做到这一点,但您应该稍微考虑一下爬虫的设计。大多数爬虫倾向于专门使用多个线程来爬行,并且每个线程都与负责获取页面的页面获取客户端相关联(在您的情况下,可能使用
WebRequest
/WebResponse
)我建议阅读这些论文:如果您实现
Mercator
设计,那么您应该能够轻松地每秒下载 50 个页面,因此您将在 80 年内下载 4000 个页面秒。您可以将结果存储在
ConcurrentDictionary
中,就像 Darin 提到的那样。您不需要在值中存储任何内容,因为您的密钥将是链接/URL,但是如果您正在执行 URL-seen Test 然后你可以将每个链接/URL 哈希为一个整数,然后将哈希存储为键,将链接/URL 作为值。需要什么完全由您决定,但如果您要执行 URL-seen 测试,那么这是必要的。
是的,ConcurrentDictionary 允许多个线程同时读取,所以应该没问题。如果您只想查看链接是否已被抓取,它会很好地工作。
如果你的爬虫设计得足够好,你应该能够在普通台式电脑上在大约 57 秒内下载和解析(提取所有链接)4000 个页面...我使用标准 C# WebRequest 大致得到了这些结果 在具有 10 Mbps 连接的 4GB、i5 3.2 GHz PC 上。
You can certainly use Parallel.For/ForEach to do that, but you should think about the design of your crawler a bit. Most crawlers tend to dedicate several threads to crawling and each thread is associated with a page fetching client which is responsible for fetching the pages (in your case, probably using the
WebRequest
/WebResponse
) I would recommend reading these papers:If you implement the
Mercator
design, then you should easily be able to download 50 pages per second, so you 4000 pages will be downloaded in 80 seconds.You can store your results in a
ConcurrentDictionary<TKey, TValue>
, like Darin mentioned. You don't need to store anything in the value, since your key would be the link/URL, however if you're performing a URL-seen Test then you can hash each link/URL into an integer and then store the hash as the key and the link/URL as the value.It's entirely up to you to decide what's necessary, but if you're performing a URL-seen Test, then it is necessary.
Yes, the
ConcurrentDictionary
allows multiple threads to read simultaneously, so it should be fine. It will work fine if you just want to see if a link has already been crawled.If you design your crawler sufficiently well, you should be able to download and parse (extracts all the links) of 4000 pages in about 57 seconds on an average desktop PC... I get roughly those results with the standard C#
WebRequest
on a 4GB, i5 3.2 GHz PC with a 10 Mbps connection.