5000 万个节点层次结构或更多

发布于 2024-09-15 22:06:52 字数 766 浏览 7 评论 0原文

有没有人有任何好主意来实现大规模可扩展的分层数据存储？它需要快速添加并能够让站点的许多用户请求有关层次结构中某个节点以下的节点数量的报告。

这就是场景......

我每小时都会添加大量节点。假设我想每小时添加 100 万个节点。他们可能会出现在整个层次结构中。理想情况下，规模将达到数十亿个节点，但目标是 5000 万个。我需要能够随时计算任何给定点以下的节点数量，并且可能会有很多人同时这样做。将其视为许多用户（可能有 100,000 个并发用户）随时都会调用的一份报告。他们可能会请求某个节点下面的所有节点。

数据库可以通过从格式化为邻接列表的平面表中读取的单个进程来创建（快速插入，慢速报告），也可以是标准设计，其中如果数据存储存在，网站用户将直接更新层次结构以应对正在创建的大量节点。

我已经使用 Treebeard 和 MySQL 在 Django 中实现了此功能。我正在使用物化路径方法，它相当不错，但相比之下我想要闪电般的速度。对于包含 30,000 个节点的数据存储，我在一台 2 年旧笔记本电脑上运行时每分钟在树底部实现 120 次插入。显然我想要的不仅仅是这个，并且认为也许有更好的数据存储可以使用。也许是 PyTables、BigTable、MongoDB 或 Cassandra？

轻松集成到 Python/Django 会很好，但如果需要的话，我总是可以用另一种语言编写系统的这一部分。如果我们使用从平面数据存储中读取的单个进程，并将其处理成一个真正高效的分层数据存储，这将非常适合报告，我想我将不会出现并发问题，从而消除对事务的需要。

无论如何，这些信息足以让我们开始。使用正确的技术这容易吗？

原文

Does anyone out there have any great ideas to achieve a massively scalable hierarchical datastore? It needs rapid add and ability to have many users of site requesting reports on the number of nodes below a certain node in hierarchy.

This is the scenario....

I will have a very large number of nodes getting added per hour. Lets say I want to add 1 million nodes per hour. They will likely be appearing all over the hierarchy. Ideally the scale will be into the billions of nodes but 50 million is a target to aim for. I need to be able to calculate at any time the number of nodes below any given point and there will likely be many people doign this at the same time. Think of it as a report that many users (100,000 concurrent perhaps) will be calling for at any one time. they might request all nodes below a certain node.

The database could either be created by a single process reading out of a flat table formatted as an adjacency list (rapid inserts, slow reporting) or it could be a standard design where users of the web site are updating the hierarchy directly if the datastore exists to cope with the massive number of nodes being created.

I already have this implemented in Django using Treebeard and MySQL. I am using a Materialised Path method and it is fairly good but I want lightning speed in comparison. With a datastore of 30,000 nodes I am achieving 120 inserts at the bottom of the tree per minute running on a 2 year old laptop. I want a lot more than this obviously and think that maybe there is a better datastore to use. Maybe PyTables, BigTable, MongoDB or Cassandra?

Easy integration into Python/Django would be good but I can always write this part of the system in another language if I have to. If we used the single process read out of flat datastore and process into a really efficient hierarchical datastore which will be perfect for reporting, I guess I will have no concurrency issues that will negate the need for transactions.

Anyway, that's enough info to get us started. Is this easy using the right technology?

分享到QQ

分享到微博