当前位置：文江博客话题详情

如何高效构建和存储语义图？

发布于 2024-10-13 05:25:05 字数 1370 浏览 6 评论 0原文

在网上冲浪时，我遇到了 Aquabrowser （无需点击，我将发布一张图片相关部分）。

它有一种很好的方式来呈现搜索结果和发现语义链接的实体。

以下是截取自其中一个演示。

左侧有您输入的单词和相关单词。单击它们可以优化您的结果。

aqua

现在作为一个示例项目，我有一个电影实体和主题的数据集（例如 wolrd-war-2 或监狱-逃逸）及其关系。

现在我想象了几个用例，首先是用户以关键字开头。例如“第二次世界大战”。

然后我想以某种方式计算相关关键词并对它们进行排名。

我想一些像这样的sql查询：

让我们假设“world war 2”有id 3。

select keywordId, count(keywordId) as total from keywordRelations 
WHERE movieId IN (select movieId from keywordRelations 
                  join movies using (movieId)      
                  where keywordId=3) 
 group by keywordId order by total desc

它基本上应该选择所有也有关键字world-war-2的电影，然后查找这些电影也有的关键字并选择那些出现最多的。

我认为通过这些关键词，我可以选择最匹配的电影，并且有一个包含相似电影和相关关键词的漂亮标签云。

我认为这应该可行，但效率非常非常低。

而且它也只是一个层次或关系。

一定有更好的方法来做到这一点，但是如何？

我基本上有一个实体的集合。它们可以是不同的实体（电影、演员、主题、情节关键字）等。

它们之间也有关系。

必须能够以某种方式有效地计算实体的“语义距离”。

我还想建立更多层次的关系。

但我完全被困住了。好吧，我尝试了不同的方法，但一切最终都在一些算法中，需要很长时间才能计算，并且运行时间呈指数增长。

有没有为此优化的数据库系统？

有人能指出我正确的方向吗？

原文

Surfing the net I ran into Aquabrowser (no need to click, I'll post a pic of the relevant part).

It has a nice way of presenting search results and discovering semantically linked entities.

Here is a screenshot taken from one of the demos.

On the left side you have they word you typed and related words.
Clicking them refines your results.

aqua

Now as an example project I have a data set of film entities and subjects (like wolrd-war-2 or prison-escape) and their relations.

Now I imagine several use cases, first where a user starts with a keyword.
For example "world war 2".

Then i would somehow like to calculate related keywords and rank them.

I think about some sql query like this:

Lets assume "world war 2" has id 3.

select keywordId, count(keywordId) as total from keywordRelations 
WHERE movieId IN (select movieId from keywordRelations 
                  join movies using (movieId)      
                  where keywordId=3) 
 group by keywordId order by total desc

which basically should select all movies which also have the keyword world-war-2 and then looks up the keywords which theese films have as well and selects those which occour the most.

I think with theese keywords I can select movies which match best and have a nice tag cloud containing similar movies and related keywords.

I think this should work but its very, very, very inefficient.

And its also only one level or relation.

There must be a better way to do this, but how??

I basically have an collection of entities. They could be different entities (movies, actors, subjects, plot-keywords) etc.

I also have relations between them.

It must somehow be possible to efficiently calculate "semantic distance" for entities.

I also would like to implement more levels of relation.

But I am totally stuck. Well I have tried different approaches but everything ends up in some algorithms that take ages to calculate and the runtime grows exponentially.

Are there any database systems available optimized for that?

Can someone point me in the right direction?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜扑 2024-10-20 05:25:05

您可能需要一个 RDF Triplestore。 Redland 是一个非常常用的，但这实际上取决于您的需求。查询是在 SPARQL 中完成的，而不是 SQL。另外...你必须喝语义网koolaid。

回复收藏 0 原文

烛影斜 2024-10-20 05:25:05

从你的标签中我看出你对 sql 更熟悉，并且我认为仍然可以有效地使用它来完成你的任务。

我有一个应用程序，其中使用 sqlite 作为数据库来实现定制的全文搜索。在搜索字段中，我可以输入术语，弹出列表将显示有关该单词的建议，对于任何下一个单词，仅显示出现在先前输入的单词出现的文章中的单词。因此，它与您描述的任务类似。

为了让事情变得更简单，我们假设我们只有三个表。我想你有不同的模式，甚至细节也可能不同，但我的解释只是为了给出一个想法。

话
[Id,Word]表格包含单词（关键字）
索引
[Id、WordId、文章Id]
此表（也按 WordId 索引）列出了出现该术语的文章
ArticleRanges
[文章Id、索引IdFrom、索引IdTo]
此表列出了任何给定文章的 Index.Id 范围（显然也按 ArticleId 索引）。该表要求任何新的或更新的文章索引表应包含具有已知从到范围的条目。我认为任何 RDBMS 都可以通过自动增量功能的一点帮助来实现这一点

，因此对于任何给定的单词字符串，您

可以将所有先前单词出现的所有文章相交。这将缩小搜索范围。 SELECT ArticleId FROM IndexWhere WordId=... INTERSECT ...
对于文章列表，您可以从 ArticleRanges 表中获取记录范围
对于此范围，您可以有效地从 Index 查询 WordId 列表，对结果进行分组以获得 Count 并最终按其排序。

虽然我将它们列为单独的操作，但最终的查询可能只是基于解析的查询字符串的大sql。