链接分析模式搜索

发布于 2025-01-02 08:35:41 字数 1165 浏览 4 评论 0原文

问题描述

我正在一个巨大的图形数据库上实现链接分析算法。

图数据库由实体(顶点)和关系(边)构成。

每个实体类型都有属性。例如人:[年龄、身高、体重]

每个关系也都有属性:例如Call(Phone,Phone) : [日期, 持续时间] 或 Own(Person, Phone) : [start-date, end-date]。

现在,我得到具有以下结构的模式:

[实体类型,约束] [关系类型,约束] [实体类型,约束] [关系类型,约束] ... [实体类型,约束]

例如:

[人,年龄>20] [自己,开始日期>1/1/2010] [电话,以“5”结尾] [电话date>1/1/2010] [电话,以“6”开头] [所有者,开始日期<1/2/2011] [人,身高>40]

我需要找到模式中所有实体和关系的所有有效分配。

我可以使用以下原语查询数据库:

  • 查找给定一组约束的前 1000 个[实体类型、关系类型、实体类型] 分配。
  • 为上述
  • 查找给定约束集的第一个[具体实体,关系类型,实体类型]分配找到下一个1000。
  • 找到上述的下 1000 个

将给定查询的所有答案保存在 RAM 中是不可能的。 每个实体-关系-实体三元组可能有数百万(数十亿?)的分配。然而,假设整个模式的分配数量很小。

我尝试过的:

对于链条ET1-RT1-ET2-RT2-ET3-RT3... 一个幼稚的实现是:

Get first 1000 (ET1-RT1-ET2)   
for each concrete ET2:
    Get first 1000 (ET2-RT2-ET3)
        for each concrete ET3:
            ...

问题是我可能多次解决相同的子问题。

我正在寻找一种可以消除此类冗余且内存效率高的算法。

注意:

我正在寻找一种算法。不适用于诸如“使用 SQL JOIN”/“使用 SPARQL”之类的答案...

Problem Description

I am implementing a link-analysis algorithm over a huge graph-database.

The graph database is constructed of entities (vertexes) and relationships (edges).

Each entity type has properties. For example Person : [age,height,weight].

Each relationship has properties as well: For example Call(Phone,Phone) : [date, duration] or Own(Person, Phone) : [start-date, end-date].

Now, I am given pattern with the following structure:

[entity-type,constrains] [relationship-type,constrains] [entity-type,constrains] [relationship-type,constrains] ... [entity-type,constrains]

For example:

[person,age>20] [own, start-date>1/1/2010] [phone, ends with '5'] [call date>1/1/2010] [phone, starts with '6'] [owned by, start-date<1/2/2011] [person, height>40].

I need to find ALL valid assignments for all the entities and relationships in the pattern.

I can query the database using the following primitives:

  • Find first 1000 [entity-type,relationship-type,entity-type] assignments for a given set of constrains.
  • Find next 1000 for the above
  • Find first [concrete-entity,relationship-type,entity-type] assignments for a given set of constrains.
  • Find next 1000 for the above

Keeping all the answers for a given query in the RAM is impossible.
There may be millions (billions?) of assignments to each entity-relationship-entity triple. However, the number of assignments for the whole pattern is assumed to be small.

What I tried:

For the chain ET1-RT1-ET2-RT2-ET3-RT3...
A naive implementation would be:

Get first 1000 (ET1-RT1-ET2)   
for each concrete ET2:
    Get first 1000 (ET2-RT2-ET3)
        for each concrete ET3:
            ...

The problem is that I may be solving the same sub-problems more than once.

I'm looking for a algorithm which eliminates such redundancies, that is also memory-efficient.

Note:

I'm looking for an algorithm. Not for an answer such as "Use SQL JOIN" / "Use SPARQL" ...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

伴梦长久 2025-01-09 08:35:41

动态规划在这里应该会有帮助。

让我们将规则简化为 R1-R2-R3...Rk 。

让 next_nodes(node x, Rule R) 返回链接 x 且符合规则 R 的所有节点。如果 R 是实体约束:如果满足条件,则返回相同节点的单例集,否则返回空集。对于关系约束,它返回所有满足条件的链接节点。

Initialize cur_set to all set of nodes.

nextset = {}

For each rule R in Ri:
    for each node x in cur_set:
        nextset = nextset U next_nodes(x)
    cur_set = nextset

您应该能够将集合存储为哈希表或树(任何 log(n) 搜索和更新时间数据结构)。

虽然我省略了保留遍历路径的部分,但这应该很容易做到。为集合中的每个元素添加一个名为“path”的属性,并在每次迭代时附加当前节点。请注意,多个路径可能会导致相同的中间/最终节点。

Dynamic programming should be helpful here.

Let us simplify the rules as R1-R2-R3...Rk here.

Let next_nodes(node x, Rule R) return all the nodes linked x that is in compliance with rule R. If R is a entity constraint: it return a singleton set of same node if condition is satisfied, else a empty set. For relational constraint, it returns all the linked nodes that satisfy the condition.

Initialize cur_set to all set of nodes.

nextset = {}

For each rule R in Ri:
    for each node x in cur_set:
        nextset = nextset U next_nodes(x)
    cur_set = nextset

You should be able to store the set as hashtable or a tree (any log(n) search and update time data structure).

Although I have omitted the part where you keep the path of the traversal, it should be pretty easy to do. For each element in the set add an attribute called 'path' and append the current node at every iteration. Note that multiple paths might lead to same intermediary/final node.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文