扩展受数据库约束的系统？

发布于 2024-08-10 00:21:21 字数 455 浏览 7 评论 0原文

以下是场景和一些建议的解决方案。还有更好的解决方案吗？

有一个系统 A 必须“分析”大量 URL。另一个系统 B 生成这些 URL——目前数据库中大约有 1000 万个。示例架构：

id URL has_extracted
1 abc.com 0
2 bit.ly  1

我的解决方案如下：

简单的解决方案：有一个 perl 脚本/进程，它将 URL（从数据库）提供给系统 B 并更新 has_extracted 列这种方法的问题是它不能很好地扩展。

解决方案2：将数据库拆分为五个（或n）个表。（我计划删除 has_extracted 列，因为在这种情况下它似乎是可扩展性的瓶颈。）

解决方案 3：删除 has_extracted 列创建另一个表来维护/跟踪每个进程跟踪的最后一个 URL。

要求提出批评/建议的解决方案。提前致谢。

原文

Following is the scenario and some proposed solutions.
Are there any better solutions?

There is a system A which has to "analyse" lots of URLs.
Another system B generates these URLs - currently there are about 10 million of them in a database.
Sample schema:

id URL has_extracted
1 abc.com 0
2 bit.ly  1

My solutions are as follows:

Naive solution: Have a perl script/process which feeds the URL (from the database) to system B and updates the has_extracted column
The problem with this approach is that it does not scale well.

Solution 2:Split up the database into five(or n) tables .
(I am planning to remove the has_extracted column because it seems such a scalability bottle-neck in this scenario.)

Solution 3:
Remove the has_extracted column
Create another table which maintains/tracks the last URL tracked by each process.

Critiques/Proposed solutions requested. Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷爱 2024-08-17 00:21:21

为什么你的简单解决方案不能很好地扩展？如果您使用批量更新并且很少提交，则可以在任何数据库上每秒更新 100 万行，而无需任何调整。

如果要运行系统 A 的多个实例，可以使用哈希函数将输入数据分组，其中系统 A 的每个实例恰好消耗一组。

如果系统 A 的实例数量恒定（例如 17），则可以使用函数 id%17 作为哈希函数。

回复收藏 0 原文

云巢 2024-08-17 00:21:21

我认为这可以如下：

URL 生成器（1 个或多个 PC）
URL 堆栈（1 个）
URL 处理器（多个）

URL 生成器生成 URL 并将所有 URL 推送到堆栈中，例如数据库中。或者在记忆中或者你想要的地方。

URL 处理器查阅 URL 堆栈，为它们提供下一个要处理的 URL。 URL 堆栈为他们提供 URL 并将其标记为给定或删除它。当 URL 处理器完成对 URL 的处理时，它会再次查阅 URL 堆栈并表示它已完成对 URL1 的处理并想要处理 URL2。然后，URL 堆栈可以从其列表中标记/删除 URL1 并给出 URL2。

如果 URL 堆栈变得狭窄，您可以对数据库进行集群化。

回复收藏 0 原文