5000 万个节点层次结构或更多
有没有人有任何好主意来实现大规模可扩展的分层数据存储?它需要快速添加并能够让站点的许多用户请求有关层次结构中某个节点以下的节点数量的报告。
这就是场景......
我每小时都会添加大量节点。假设我想每小时添加 100 万个节点。他们可能会出现在整个层次结构中。理想情况下,规模将达到数十亿个节点,但目标是 5000 万个。我需要能够随时计算任何给定点以下的节点数量,并且可能会有很多人同时这样做。将其视为许多用户(可能有 100,000 个并发用户)随时都会调用的一份报告。他们可能会请求某个节点下面的所有节点。
数据库可以通过从格式化为邻接列表的平面表中读取的单个进程来创建(快速插入,慢速报告),也可以是标准设计,其中如果数据存储存在,网站用户将直接更新层次结构以应对正在创建的大量节点。
我已经使用 Treebeard 和 MySQL 在 Django 中实现了此功能。我正在使用物化路径方法,它相当不错,但相比之下我想要闪电般的速度。对于包含 30,000 个节点的数据存储,我在一台 2 年旧笔记本电脑上运行时每分钟在树底部实现 120 次插入。显然我想要的不仅仅是这个,并且认为也许有更好的数据存储可以使用。也许是 PyTables、BigTable、MongoDB 或 Cassandra?
轻松集成到 Python/Django 会很好,但如果需要的话,我总是可以用另一种语言编写系统的这一部分。如果我们使用从平面数据存储中读取的单个进程,并将其处理成一个真正高效的分层数据存储,这将非常适合报告,我想我将不会出现并发问题,从而消除对事务的需要。
无论如何,这些信息足以让我们开始。使用正确的技术这容易吗?
Does anyone out there have any great ideas to achieve a massively scalable hierarchical datastore? It needs rapid add and ability to have many users of site requesting reports on the number of nodes below a certain node in hierarchy.
This is the scenario....
I will have a very large number of nodes getting added per hour. Lets say I want to add 1 million nodes per hour. They will likely be appearing all over the hierarchy. Ideally the scale will be into the billions of nodes but 50 million is a target to aim for. I need to be able to calculate at any time the number of nodes below any given point and there will likely be many people doign this at the same time. Think of it as a report that many users (100,000 concurrent perhaps) will be calling for at any one time. they might request all nodes below a certain node.
The database could either be created by a single process reading out of a flat table formatted as an adjacency list (rapid inserts, slow reporting) or it could be a standard design where users of the web site are updating the hierarchy directly if the datastore exists to cope with the massive number of nodes being created.
I already have this implemented in Django using Treebeard and MySQL. I am using a Materialised Path method and it is fairly good but I want lightning speed in comparison. With a datastore of 30,000 nodes I am achieving 120 inserts at the bottom of the tree per minute running on a 2 year old laptop. I want a lot more than this obviously and think that maybe there is a better datastore to use. Maybe PyTables, BigTable, MongoDB or Cassandra?
Easy integration into Python/Django would be good but I can always write this part of the system in another language if I have to. If we used the single process read out of flat datastore and process into a really efficient hierarchical datastore which will be perfect for reporting, I guess I will have no concurrency issues that will negate the need for transactions.
Anyway, that's enough info to get us started. Is this easy using the right technology?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您是否查看过 Neo4J 图形数据库?它看起来非常强大,并且有 Python 包装器 和 对 Django 的一些支持(正在开发中)。 Neo 在 Java 上运行,您可以将其与 Jython 或 JPype 和 CPython 一起使用。
Have you looked at the Neo4J graph database? It seems pretty darn capable, and has a Python wrapper and some support (in development) for Django. Neo runs on Java, and you can use it either with Jython or JPype and CPython.