HBase/Cassandra 上的属性图的数据模型

发布于 2024-10-07 11:38:04 字数 1506 浏览 2 评论 0原文

我愿意将属性图存储到 HBase 中。属性图是节点和边都具有属性的图,只要边属于不同类型,多条边就可以链接相同的节点元组。

我的查询模式将是询问属性和邻居或遍历图表。一个例子是:Vertex[name=claudio]=>OutgoingEdge[knows]=>Vertex[gender=female],这将给我所有claudio喜欢的女性。

我知道图形数据库就是这样做的,但在数据集庞大的情况下,它们通常不会在多个节点上扩展。所以我愿意在 NoSQL ColumnStore(HBase、Cassandra...)上实现这一点,

我的数据模型如下。

顶点表
键:vertexid(uuid)
系列“属性:”:<属性名称>=><属性值>,...
系列“OutgoingEdges:”:<边键>=><其他顶点ID>,...
系列“IncomingEdges:”:与传出边缘相同...

该表允许我快速获取顶点的属性并 它的邻接表。我无法使用 vertexid 作为另一个端点 因为多个边(具有不同类型)可以连接相同的两个 顶点。

边表
key:边键(composite(<源顶点id>,<目标顶点id>, <边类型名称>)) (即 vertexid1_vertexid2_knows)
Family "Properties:": <属性名称>=><属性值>, ...

该表允许我快速获取边的属性。

边类型
键:composite(

, "out|in",) (即 vertexid1_out_knows)
Family "Neighbor:":=>null,...

此表允许我搜索/扫描传入的边 或从顶点传出并且属于特定类型,并且将是 API 遍历能力的核心(所以我希望它能像 在网络 I/O (RPC)、磁盘 I/O(查找)方面都是可能的。它 还应该“缩放”图表的大小,这意味着 此类操作的成本应取决于图表的增长 从顶点发出的边数,而不是总数 的顶点和边。 上面的例子我会考虑 vertexid1 源顶点 属性名称:claudio 我会扫描 vertexid1_out_knows 并接收列表 顶点相连。之后我可以扫描该列 在这些顶点上“属性:性别”并寻找那些具有 “女性”价值。

问题:

1)一般:您是否看到更适合我的操作的数据模型?
2)我可以将所有内容放入一张表中,其中某些键 家庭将为空(即“OutgoingEdges:”家庭不会使 边缘感)?我想要这样,因为你可以看到所有的钥匙 由 vertexid uuid 前缀组成,因此它们会非常紧凑 并且大部分适合同一区域服务器。
3)我想对于扫描,我会广泛使用过滤器。我 我猜正则表达式过滤器会是我的朋友。您是否有以下顾虑 应用于该数据模型的过滤器的性能?

I'm willing to store a Property Graph into HBase. A Property Graph is a graph nodes and edges have properties and multiple edges can link the same tuple of nodes as long as the edges belong to different types.

My query pattern will be either asking for properties and neighborhood or traversing the graph. An example is: Vertex[name=claudio]=>OutgoingEdge[knows]=>Vertex[gender=female], which will give me all the female people that claudio likes.

I know that a graph database does just this, but they usually don't scale on multiple nodes in case of a huge dataset. So I'm willing to implement this on a NoSQL ColumnStore (HBase, Cassandra...)

My datamodel follows.

Vertices Table:
key: vertexid (uuid)
Family "Properties:": <property name>=><property value>, ...
Family "OutgoingEdges:": <edge key>=><other vertexid>, ...
Family "IncomingEdges:": same as outgoing edges...

This table allows me to fetch quickly the properties of a vertex and
its adjacency list. I can't use the vertexid as the other endpoint
because multiple edges (with different types) can connect the same two
vertices.

Edges Table:
key: edge key (composite(<source vertexid>, <destination vertexid>,
<edge typename>)) (i.e. vertexid1_vertexid2_knows)
Family "Properties:": <property name>=><property value>, ...

This table allows me to fetch quickly the properties of an edge.

Edges Types:
key: composite(<source vertexid>, "out|in", <edge typename>) (i.e.
vertexid1_out_knows)
Family "Neighbor:": <destination vertexid>=>null,...

This table allows me to search/scan for edges that are either incoming
or outgoing from a vertex and belong to specific type and would be the
core of the traversing ability of the API (so i want it to be as fast as
possible both in terms of network I/O (RPCs), disk I/O (seek)). It
should also "scale" on the size of the graph, meaning that with the
growth of the graph the cost of this type of operation should depend on
the number of edges outgoing from the vertex and not on the total number
of vertices and edges.
The example above i'd be considering vertexid1 the source vertex with
property name:claudio i'd scan vertexid1_out_knows and receive the list of
vertices connected. After that i can scan on the column
"Properties:gender" on these vertices and look for those having the
"female" value.

Questions:

1) General: do you see a better data model for my operations?
2) Can i fit everything in one table where for certain keys some
families would be empty (i.e. the "OutgoingEdges:" family would not make
sense for the edges)? I'd like that because as you can see all the keys
are composed by the vertexid uuid prefix, so they would be very compact
and fit mostly on the same regionserver.
3) I guess that for the scanning I'd make extensive use of Filters. I
guess regexp Filter will be my friend. Do you have concerns about
performance of filters applied to this data model?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

听风念你 2024-10-14 11:38:04

这种类型的模型对于 Cassandra 来说似乎是一个明智的起点(对 HBase 不太了解) - 但对于任何分布式存储,您在遍历时都会遇到问题,因为遍历将跨越多个节点。

这就是为什么像 Neo4J 这样的专用图数据库采用单节点设计,并尝试将所有数据保存在 RAM 中。

查找特定节点或边的属性应该可以很好地工作并水平扩展 - Twitter 的 FlockDB (现在显然已被放弃)一个显着的例子。

您还需要考虑是否需要除 ID 之外的查找(即是否需要任何索引)?

This type of model looks like a sensible starting point for Cassandra (don't know much about HBase) - but for any distributed store you will run up against problems when traversing, because traversals will cross multiple nodes.

This is why dedicated graph databases such as Neo4J use a single-node design, and try to keep all data in RAM.

Looking up properties of particular nodes or edges should work well and scale horizontally - Twitter's FlockDB (now apparently abandoned) was a notable example of this.

You also need to consider whether you need lookups other than by ID (i.e. do you need any indexes)?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文