HBase/Cassandra 上的属性图的数据模型
我愿意将属性图存储到 HBase 中。属性图是节点和边都具有属性的图,只要边属于不同类型,多条边就可以链接相同的节点元组。
我的查询模式将是询问属性和邻居或遍历图表。一个例子是:Vertex[name=claudio]=>OutgoingEdge[knows]=>Vertex[gender=female],这将给我所有claudio喜欢的女性。
我知道图形数据库就是这样做的,但在数据集庞大的情况下,它们通常不会在多个节点上扩展。所以我愿意在 NoSQL ColumnStore(HBase、Cassandra...)上实现这一点,
我的数据模型如下。
顶点表:
键:vertexid(uuid)
系列“属性:”:<属性名称>=><属性值>,...
系列“OutgoingEdges:”:<边键>=><其他顶点ID>,...
系列“IncomingEdges:”:与传出边缘相同...
该表允许我快速获取顶点的属性并 它的邻接表。我无法使用 vertexid 作为另一个端点 因为多个边(具有不同类型)可以连接相同的两个 顶点。
边表:
key:边键(composite(<源顶点id>,<目标顶点id>, <边类型名称>)) (即 vertexid1_vertexid2_knows)
Family "Properties:": <属性名称>=><属性值>, ...
该表允许我快速获取边的属性。
边类型:
键:composite(
I'm willing to store a Property Graph into HBase. A Property Graph is a graph nodes and edges have properties and multiple edges can link the same tuple of nodes as long as the edges belong to different types.
My query pattern will be either asking for properties and neighborhood or traversing the graph. An example is: Vertex[name=claudio]=>OutgoingEdge[knows]=>Vertex[gender=female], which will give me all the female people that claudio likes.
I know that a graph database does just this, but they usually don't scale on multiple nodes in case of a huge dataset. So I'm willing to implement this on a NoSQL ColumnStore (HBase, Cassandra...)
My datamodel follows.
Vertices Table:
key: vertexid (uuid)
Family "Properties:": <property name>=><property value>, ...
Family "OutgoingEdges:": <edge key>=><other vertexid>, ...
Family "IncomingEdges:": same as outgoing edges...
This table allows me to fetch quickly the properties of a vertex and
its adjacency list. I can't use the vertexid as the other endpoint
because multiple edges (with different types) can connect the same two
vertices.
Edges Table:
key: edge key (composite(<source vertexid>, <destination vertexid>,
<edge typename>)) (i.e. vertexid1_vertexid2_knows)
Family "Properties:": <property name>=><property value>, ...
This table allows me to fetch quickly the properties of an edge.
Edges Types:
key: composite(<source vertexid>, "out|in", <edge typename>) (i.e.
vertexid1_out_knows)
Family "Neighbor:": <destination vertexid>=>null,...
This table allows me to search/scan for edges that are either incoming
or outgoing from a vertex and belong to specific type and would be the
core of the traversing ability of the API (so i want it to be as fast as
possible both in terms of network I/O (RPCs), disk I/O (seek)). It
should also "scale" on the size of the graph, meaning that with the
growth of the graph the cost of this type of operation should depend on
the number of edges outgoing from the vertex and not on the total number
of vertices and edges.
The example above i'd be considering vertexid1 the source vertex with
property name:claudio i'd scan vertexid1_out_knows and receive the list of
vertices connected. After that i can scan on the column
"Properties:gender" on these vertices and look for those having the
"female" value.
Questions:
1) General: do you see a better data model for my operations?
2) Can i fit everything in one table where for certain keys some
families would be empty (i.e. the "OutgoingEdges:" family would not make
sense for the edges)? I'd like that because as you can see all the keys
are composed by the vertexid uuid prefix, so they would be very compact
and fit mostly on the same regionserver.
3) I guess that for the scanning I'd make extensive use of Filters. I
guess regexp Filter will be my friend. Do you have concerns about
performance of filters applied to this data model?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这种类型的模型对于 Cassandra 来说似乎是一个明智的起点(对 HBase 不太了解) - 但对于任何分布式存储,您在遍历时都会遇到问题,因为遍历将跨越多个节点。
这就是为什么像 Neo4J 这样的专用图数据库采用单节点设计,并尝试将所有数据保存在 RAM 中。
查找特定节点或边的属性应该可以很好地工作并水平扩展 - Twitter 的 FlockDB (现在显然已被放弃)一个显着的例子。
您还需要考虑是否需要除 ID 之外的查找(即是否需要任何索引)?
This type of model looks like a sensible starting point for Cassandra (don't know much about HBase) - but for any distributed store you will run up against problems when traversing, because traversals will cross multiple nodes.
This is why dedicated graph databases such as Neo4J use a single-node design, and try to keep all data in RAM.
Looking up properties of particular nodes or edges should work well and scale horizontally - Twitter's FlockDB (now apparently abandoned) was a notable example of this.
You also need to consider whether you need lookups other than by ID (i.e. do you need any indexes)?