架构——如何用一万台机器高效爬取网络?

发布于 2024-12-13 09:47:21 字数 242 浏览 0 评论 0原文

假设我有一个包含 10,000 台机器的网络。我想使用所有这些机器尽可能快地抓取网络。所有页面只能下载一次。此外,必须不存在单点故障,并且我们必须最大限度地减少机器之间所需的通信数量。你将如何实现这一目标?

有什么比使用一致性哈希在所有机器上分配负载并最大限度地减少它们之间的通信更有效的方法吗? ?

Let’s pretend I have a network of 10,000 machines. I want to use all those machines to crawl the web as fast as possible. All pages should be downloaded only once. In addition there must be no single point of failure and we must minimize the number of communication required between machines. How would you accomplish this?

Is there anything more efficient than using consistent hashing to distribute the load across all machines and minimize communication between them?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

疯狂的代价 2024-12-20 09:47:21
  1. 使用分布式地图缩减系统,例如Hadoop 来划分工作空间。
  2. 如果您想变得聪明,或者在学术背景下这样做,请尝试非线性降维
  3. 最简单的实现可能是在名称空间键(例如域名或URL)上使用散列函数。使用 Chord 为每台机器分配哈希值的子集来处理。
  1. Use a distributed Map Reduction system like Hadoop to divide the workspace.
  2. If you want to be clever, or doing this in an academic context then try a Nonlinear dimension reduction.
  3. Simplest implementation would probably be to use a hashing function on the name space key e.g. the domain name or URL. Use a Chord to assign each machine a subset of the hash values to process.
梦与时光遇 2024-12-20 09:47:21

一种想法是使用工作队列(目录或数据库),假设您将计算存储以满足您的冗余标准。

\retrieve

\retrieve\server1

\retrieve\server...

\retrieve\server10000

\in-process

\complete

1.) 所有要成为种子的页面将被散列并使用散列作为文件根放置在队列中。

2.) 在放入队列之前,请检查完整队列和进程中队列,以确保不会重新排队

3.) 每个服务器从检索队列中检索随机批次 (1-N) 文件并尝试移动 4.) 重命名过程

失败的文件被假定为已被另一个进程“声明”

5.) 可以移动的文件将被处理,在进程内目录中放置一个标记以防止重新排队。

6.) 下载文件并将其放入 \Complete 队列

7.) 从进程内目录和服务器目录中清理文件

8.) 每 1,000 次运行检查最旧的 10 个进程内文件,尝试将它们从服务器中移出队列回到通用检索队列。如果服务器挂起,这将有所帮助,并且还可以对慢速服务器进行负载平衡。

对于检索、进程内和完整服务器,大多数文件系统讨厌 1 个目录中的数百万个文件,根据散列 \abc\def\123\ 的字符将存储划分为段 \abc\def\123\ 将是文件 abcdef123FFFFFF… 的目录。如果您要扩展到数十亿次下载。

如果您使用 mongo DB 而不是常规文件存储,那么大部分问题都可以避免,并且您可以从分片等中受益......

One Idea would be to use work queues (directories or DB), assuming you will be working out storage such that it meets your criteria for redundancy.

\retrieve

\retrieve\server1

\retrieve\server...

\retrieve\server10000

\in-process

\complete

1.) All pages to be seeds will be hashed and be placed in the queue using the hash as a file root.

2.) Before putting in the queue you check the complete and in-process queues to make sure you don't re-queue

3.) Each server retrieves a random batch (1-N) files from the retrieve queue and attempts to move it to the private queue

4.) Files that fail the rename process are assumed to have been “claimed” by another process

5.) Files that can be moved are to be processed put a marker in in-process directory to prevent re-queuing.

6.) Download the file and place it into the \Complete queue

7.) Clean file out of the in-process and server directories

8.) Every 1,000 runs check the oldest 10 in-process files by trying to move them from their server queues back into the general retrieve queue. This will help if a server hangs and also should load balance slow servers.

For the Retrieve, in-process and complete servers most file systems hate millions of files in 1 directory, Divide storage into segments based on the characters of the hash \abc\def\123\ would be the directory for file abcdef123FFFFFF…. If you were scaling to billions of downloads.

If you are using a mongo DB instead of a regular file store much of these problems would be avoided and you could benefit from the sharding etc…

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文