如何最好地并行化网页解析？

发布于 2024-12-09 05:24:44 字数 496 浏览 1 评论 0原文

我正在使用 html 敏捷包来解析论坛网站的各个页面。因此，解析方法返回页面链接上的所有主题/线程链接，作为参数传递。我将所有已解析页面的所有这些主题链接收集在一个集合中。

之后，我检查它们是否在我已查看的网址的 Dictionary 中，如果不在，则将它们添加到新列表中，并且 UI 显示此列表，这基本上是新主题/自上次以来创建的线程。

由于所有这些操作看起来都是独立的，那么并行化的最佳方法是什么？

我应该使用 .NET 4.0 的 Parallel.For/ForEach 吗？

不管怎样，我怎样才能将每个页面的结果收集到一个集合中？或者说这是没有必要的？

每当解析方法完成时，我是否可以从我的集中式Dictionary 中读取它们，以查看它们是否同时存在？

如果我运行这个程序 4000 个页面，大约需要 90 分钟，如果我可以使用所有 8 个核心在大约 10 分钟内完成相同的任务，那就太好了。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓝眼泪 2024-12-16 05:24:44

Parallel.For/ForEach 与 ConcurrentDictionary< 结合使用;TKey, TValue> 在不同线程之间共享状态似乎是实现此目的的好方法。并发字典确保多个线程的安全读/写。

回复收藏 0 原文

自演自醉 2024-12-16 05:24:44

之后，我检查它们是否在我已查看的网址词典中，如果不在，则将它们添加到新列表中，并且 UI 显示此列表，这基本上是自创建以来创建的新主题/线程上次。
由于所有这些操作看起来都是独立的，那么并行化的最佳方法是什么？

您当然可以使用 Parallel.For/ForEach 来做到这一点，但您应该稍微考虑一下爬虫的设计。大多数爬虫倾向于专门使用多个线程来爬行，并且每个线程都与负责获取页面的页面获取客户端相关联（在您的情况下，可能使用 WebRequest/WebResponse）我建议阅读这些论文：

Mercator：一个可伸缩、可扩展的网络爬虫（一篇 11 页的论文，应该很容易读懂）。
IRLbot：扩展到 60 亿页及以上 (一篇 10 页的论文，描述了在 150 Mbit 连接上以每秒约 600 页的速度爬行的爬行程序）。
IRLbot：扩展到 60 亿页及以上：全文

如果您实现Mercator 设计，那么您应该能够轻松地每秒下载 50 个页面，因此您将在 80 年内下载 4000 个页面秒。

无论哪种方式，如何将每个页面的结果收集到一个集合中？

您可以将结果存储在 ConcurrentDictionary 中，就像 Darin 提到的那样。您不需要在值中存储任何内容，因为您的密钥将是链接/URL，但是如果您正在执行 URL-seen Test 然后你可以将每个链接/URL 哈希为一个整数，然后将哈希存储为键，将链接/URL 作为值。

或者这没有必要吗？

需要什么完全由您决定，但如果您要执行 URL-seen 测试，那么这是必要的。

每当解析方法完成时，我是否可以从集中式词典中读取它们，以查看它们是否同时存在？

是的，ConcurrentDictionary 允许多个线程同时读取，所以应该没问题。如果您只想查看链接是否已被抓取，它会很好地工作。

如果我运行这个程序 4000 个页面，大约需要 90 分钟，如果我可以使用所有 8 个核心在大约 10 分钟内完成相同的任务，那就太好了。

如果你的爬虫设计得足够好，你应该能够在普通台式电脑上在大约 57 秒内下载和解析（提取所有链接）4000 个页面...我使用标准 C# WebRequest 大致得到了这些结果在具有 10 Mbps 连接的 4GB、i5 3.2 GHz PC 上。

After that, I check if they are on my Dictionary of already-viewed urls, and if they are not, then I add them to a new list and the UI shows this list, which is basically new topics/threads created since last time.
Since all these operations seem independent, what would be the best way to parallelize this?

You can certainly use Parallel.For/ForEach to do that, but you should think about the design of your crawler a bit. Most crawlers tend to dedicate several threads to crawling and each thread is associated with a page fetching client which is responsible for fetching the pages (in your case, probably using the WebRequest/WebResponse) I would recommend reading these papers:

Mercator: A scalable, extensible Web crawler (an 11 page paper, should be a pretty light read).
IRLbot: Scaling to 6 Billion Pages and Beyond (a 10 page paper that describes a crawler that crawls at about 600 pages per second on a 150 Mbit connection).
IRLbot: Scaling to 6 billion pages and beyond: full paper

If you implement the Mercator design, then you should easily be able to download 50 pages per second, so you 4000 pages will be downloaded in 80 seconds.

Either way, how can I gather the results of each page in a single collection?

You can store your results in a ConcurrentDictionary<TKey, TValue>, like Darin mentioned. You don't need to store anything in the value, since your key would be the link/URL, however if you're performing a URL-seen Test then you can hash each link/URL into an integer and then store the hash as the key and the link/URL as the value.

Or is this not necessary?

It's entirely up to you to decide what's necessary, but if you're performing a URL-seen Test, then it is necessary.

Can I read from my centralized Dictionary whenever a parse method finishes to see if they are there, simultaneously?

Yes, the ConcurrentDictionary allows multiple threads to read simultaneously, so it should be fine. It will work fine if you just want to see if a link has already been crawled.

If I run this program for 4000 pages, it takes like 90 mins, it would be great if I could use all my 8 cores to finish the same task in ~10 mins.

If you design your crawler sufficiently well, you should be able to download and parse (extracts all the links) of 4000 pages in about 57 seconds on an average desktop PC... I get roughly those results with the standard C# WebRequest on a 4GB, i5 3.2 GHz PC with a 10 Mbps connection.

回复收藏 0 原文

~没有更多了~