扩展受数据库约束的系统?
以下是场景和一些建议的解决方案。 还有更好的解决方案吗?
有一个系统 A 必须“分析”大量 URL。 另一个系统 B 生成这些 URL——目前数据库中大约有 1000 万个。 示例架构:
id URL has_extracted
1 abc.com 0
2 bit.ly 1
我的解决方案如下:
简单的解决方案:有一个 perl 脚本/进程,它将 URL(从数据库)提供给系统 B 并更新 has_extracted 列 这种方法的问题是它不能很好地扩展。
解决方案2:将数据库拆分为五个(或n)个表。 (我计划删除 has_extracted 列,因为在这种情况下它似乎是可扩展性的瓶颈。)
解决方案 3: 删除 has_extracted 列 创建另一个表来维护/跟踪每个进程跟踪的最后一个 URL。
要求提出批评/建议的解决方案。提前致谢。
Following is the scenario and some proposed solutions.
Are there any better solutions?
There is a system A which has to "analyse" lots of URLs.
Another system B generates these URLs - currently there are about 10 million of them in a database.
Sample schema:
id URL has_extracted
1 abc.com 0
2 bit.ly 1
My solutions are as follows:
Naive solution: Have a perl script/process which feeds the URL (from the database) to system B and updates the has_extracted column
The problem with this approach is that it does not scale well.
Solution 2:Split up the database into five(or n) tables .
(I am planning to remove the has_extracted column because it seems such a scalability bottle-neck in this scenario.)
Solution 3:
Remove the has_extracted column
Create another table which maintains/tracks the last URL tracked by each process.
Critiques/Proposed solutions requested. Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
为什么你的简单解决方案不能很好地扩展?如果您使用批量更新并且很少提交,则可以在任何数据库上每秒更新 100 万行,而无需任何调整。
如果要运行系统 A 的多个实例,可以使用哈希函数将输入数据分组,其中系统 A 的每个实例恰好消耗一组。
如果系统 A 的实例数量恒定(例如 17),则可以使用函数
id%17
作为哈希函数。why doesn't your naive solution scale well? if you're using bulk updates and commit infrequently, you can update 1 million rows per second on any database, without any tuning.
If you want to run multiple instances of system A, you can use a hash function to divide the input data into groups, where each instnace of system A consumes exactly one group.
If you have a constant number of instances of system A, e.g. 17, you can use the function
id%17
as the hash function.我认为这可以如下:
URL 生成器生成 URL 并将所有 URL 推送到堆栈中,例如数据库中。或者在记忆中或者你想要的地方。
URL 处理器查阅 URL 堆栈,为它们提供下一个要处理的 URL。 URL 堆栈为他们提供 URL 并将其标记为给定或删除它。当 URL 处理器完成对 URL 的处理时,它会再次查阅 URL 堆栈并表示它已完成对 URL1 的处理并想要处理 URL2。然后,URL 堆栈可以从其列表中标记/删除 URL1 并给出 URL2。
如果 URL 堆栈变得狭窄,您可以对数据库进行集群化。
I think this can be as follows:
URL generator(s) generates URLs and pushes all of them to stack, say, in database. Or in memory or where you want.
URL processors consult URL stack to give them one next URL to process. URL Stack gives them URL and marks it as a given or deletes it. When URL processor is finished processing an URL, it consults URL stack again and says, that it finished processing URL1 and wants to process URL2. URL Stack can then mark/delete URL1 from it's list and give URL2.
If URL stack becomes a narrow thing, you can just clusterize database.
我不知何故觉得我的问题与此 链接(下面提供的摘录)。上述链接和此 链接 中的解决方案 - “数据库对于消息传递来说很糟糕”为我提供了实施更好解决方案的更好方向。
摘录:所以你想构建一个可以完成工作的系统。您希望作业能够并行运行以提高速度,同时也实现冗余。该系统需要进行协调,例如,相同的作业不会执行两次,每个作业的状态很容易查看,并且多个服务器可以通过简单地查询中央源来运行作业。
I somehow feel my problem is similar to the one posted on this link (an extract provided below). The solution in the afore mentioned link and this link - "Databases suck for messaging" have given me a better direction at implementing a better solution.
Extract: So you want to build a system that does jobs. You want the jobs to be able to be run in parallel for speed, but also for redundancy. This system needs to be coordinated so, for example, the same jobs aren't being done twice, the status of every job is easy to see, and multiple servers can run jobs by simply querying the central source.