当前位置：文江博客话题详情

HBase/Cassandra 上的属性图的数据模型

发布于 2024-10-07 11:38:04 字数 1506 浏览 2 评论 0原文

我愿意将属性图存储到 HBase 中。属性图是节点和边都具有属性的图，只要边属于不同类型，多条边就可以链接相同的节点元组。

我的查询模式将是询问属性和邻居或遍历图表。一个例子是：Vertex[name=claudio]=>OutgoingEdge[knows]=>Vertex[gender=female]，这将给我所有claudio喜欢的女性。

我知道图形数据库就是这样做的，但在数据集庞大的情况下，它们通常不会在多个节点上扩展。所以我愿意在 NoSQL ColumnStore（HBase、Cassandra...）上实现这一点，

我的数据模型如下。

顶点表：
键：vertexid（uuid）
系列“属性：”：<属性名称>=><属性值>，...
系列“OutgoingEdges：”：<边键>=><其他顶点ID>，...
系列“IncomingEdges：”：与传出边缘相同...

该表允许我快速获取顶点的属性并它的邻接表。我无法使用 vertexid 作为另一个端点因为多个边（具有不同类型）可以连接相同的两个顶点。

边表：
key：边键（composite（<源顶点id>，<目标顶点id>， <边类型名称>)) （即 vertexid1_vertexid2_knows）
Family "Properties:": <属性名称>=><属性值>, ...

该表允许我快速获取边的属性。

边类型：
键：composite(

, "out|in",) （即 vertexid1_out_knows)
Family "Neighbor:":=>null,...

此表允许我搜索/扫描传入的边或从顶点传出并且属于特定类型，并且将是 API 遍历能力的核心（所以我希望它能像在网络 I/O (RPC)、磁盘 I/O（查找）方面都是可能的。它还应该“缩放”图表的大小，这意味着此类操作的成本应取决于图表的增长从顶点发出的边数，而不是总数的顶点和边。上面的例子我会考虑 vertexid1 源顶点属性名称：claudio 我会扫描 vertexid1_out_knows 并接收列表顶点相连。之后我可以扫描该列在这些顶点上“属性：性别”并寻找那些具有 “女性”价值。

问题：

1）一般：您是否看到更适合我的操作的数据模型？
2）我可以将所有内容放入一张表中，其中某些键家庭将为空（即“OutgoingEdges：”家庭不会使边缘感）？我想要这样，因为你可以看到所有的钥匙由 vertexid uuid 前缀组成，因此它们会非常紧凑并且大部分适合同一区域服务器。
3）我想对于扫描，我会广泛使用过滤器。我我猜正则表达式过滤器会是我的朋友。您是否有以下顾虑应用于该数据模型的过滤器的性能？

原文

I'm willing to store a Property Graph into HBase. A Property Graph is a graph nodes and edges have properties and multiple edges can link the same tuple of nodes as long as the edges belong to different types.

My query pattern will be either asking for properties and neighborhood or traversing the graph. An example is: Vertex[name=claudio]=>OutgoingEdge[knows]=>Vertex[gender=female], which will give me all the female people that claudio likes.

I know that a graph database does just this, but they usually don't scale on multiple nodes in case of a huge dataset. So I'm willing to implement this on a NoSQL ColumnStore (HBase, Cassandra...)

My datamodel follows.

Vertices Table:
key: vertexid (uuid)
Family "Properties:": <property name>=><property value>, ...
Family "OutgoingEdges:": <edge key>=><other vertexid>, ...
Family "IncomingEdges:": same as outgoing edges...

This table allows me to fetch quickly the properties of a vertex and
its adjacency list. I can't use the vertexid as the other endpoint
because multiple edges (with different types) can connect the same two
vertices.

Edges Table:
key: edge key (composite(<source vertexid>, <destination vertexid>,
<edge typename>)) (i.e. vertexid1_vertexid2_knows)
Family "Properties:": <property name>=><property value>, ...

This table allows me to fetch quickly the properties of an edge.

Edges Types:
key: composite(<source vertexid>, "out|in", <edge typename>) (i.e.
vertexid1_out_knows)
Family "Neighbor:": <destination vertexid>=>null,...

This table allows me to search/scan for edges that are either incoming
or outgoing from a vertex and belong to specific type and would be the
core of the traversing ability of the API (so i want it to be as fast as
possible both in terms of network I/O (RPCs), disk I/O (seek)). It
should also "scale" on the size of the graph, meaning that with the
growth of the graph the cost of this type of operation should depend on
the number of edges outgoing from the vertex and not on the total number
of vertices and edges.
The example above i'd be considering vertexid1 the source vertex with
property name:claudio i'd scan vertexid1_out_knows and receive the list of
vertices connected. After that i can scan on the column
"Properties:gender" on these vertices and look for those having the
"female" value.

Questions:

1) General: do you see a better data model for my operations?
2) Can i fit everything in one table where for certain keys some
families would be empty (i.e. the "OutgoingEdges:" family would not make
sense for the edges)? I'd like that because as you can see all the keys
are composed by the vertexid uuid prefix, so they would be very compact
and fit mostly on the same regionserver.
3) I guess that for the scanning I'd make extensive use of Filters. I
guess regexp Filter will be my friend. Do you have concerns about
performance of filters applied to this data model?

分享到QQ

分享到微博