如何高效构建和存储语义图?
在网上冲浪时,我遇到了 Aquabrowser (无需点击,我将发布一张图片相关部分)。
它有一种很好的方式来呈现搜索结果和发现语义链接的实体。
左侧有您输入的单词和相关单词。 单击它们可以优化您的结果。
现在作为一个示例项目,我有一个电影实体和主题的数据集(例如 wolrd-war-2 或监狱-逃逸)及其关系。
现在我想象了几个用例,首先是用户以关键字开头。 例如“第二次世界大战”。
然后我想以某种方式计算相关关键词并对它们进行排名。
我想一些像这样的sql查询:
让我们假设“world war 2”有id 3。
select keywordId, count(keywordId) as total from keywordRelations
WHERE movieId IN (select movieId from keywordRelations
join movies using (movieId)
where keywordId=3)
group by keywordId order by total desc
它基本上应该选择所有也有关键字world-war-2的电影,然后查找这些电影也有的关键字并选择那些出现最多的。
我认为通过这些关键词,我可以选择最匹配的电影,并且有一个包含相似电影和相关关键词的漂亮标签云。
我认为这应该可行,但效率非常非常低。
而且它也只是一个层次或关系。
一定有更好的方法来做到这一点,但是如何?
我基本上有一个实体的集合。它们可以是不同的实体(电影、演员、主题、情节关键字)等。
它们之间也有关系。
必须能够以某种方式有效地计算实体的“语义距离”。
我还想建立更多层次的关系。
但我完全被困住了。好吧,我尝试了不同的方法,但一切最终都在一些算法中,需要很长时间才能计算,并且运行时间呈指数增长。
有没有为此优化的数据库系统?
有人能指出我正确的方向吗?
Surfing the net I ran into Aquabrowser (no need to click, I'll post a pic of the relevant part).
It has a nice way of presenting search results and discovering semantically linked entities.
Here is a screenshot taken from one of the demos.
On the left side you have they word you typed and related words.
Clicking them refines your results.
Now as an example project I have a data set of film entities and subjects (like wolrd-war-2 or prison-escape) and their relations.
Now I imagine several use cases, first where a user starts with a keyword.
For example "world war 2".
Then i would somehow like to calculate related keywords and rank them.
I think about some sql query like this:
Lets assume "world war 2" has id 3.
select keywordId, count(keywordId) as total from keywordRelations
WHERE movieId IN (select movieId from keywordRelations
join movies using (movieId)
where keywordId=3)
group by keywordId order by total desc
which basically should select all movies which also have the keyword world-war-2 and then looks up the keywords which theese films have as well and selects those which occour the most.
I think with theese keywords I can select movies which match best and have a nice tag cloud containing similar movies and related keywords.
I think this should work but its very, very, very inefficient.
And its also only one level or relation.
There must be a better way to do this, but how??
I basically have an collection of entities. They could be different entities (movies, actors, subjects, plot-keywords) etc.
I also have relations between them.
It must somehow be possible to efficiently calculate "semantic distance" for entities.
I also would like to implement more levels of relation.
But I am totally stuck. Well I have tried different approaches but everything ends up in some algorithms that take ages to calculate and the runtime grows exponentially.
Are there any database systems available optimized for that?
Can someone point me in the right direction?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可能需要一个 RDF Triplestore。 Redland 是一个非常常用的,但这实际上取决于您的需求。查询是在 SPARQL 中完成的,而不是 SQL。另外...你必须喝语义网koolaid。
You probably want an RDF triplestore. Redland is a pretty commonly used one, but it really depends on your needs. Queries are done in SPARQL, not SQL. Also... you have to drink the semantic web koolaid.
从你的标签中我看出你对 sql 更熟悉,并且我认为仍然可以有效地使用它来完成你的任务。
我有一个应用程序,其中使用 sqlite 作为数据库来实现定制的全文搜索。在搜索字段中,我可以输入术语,弹出列表将显示有关该单词的建议,对于任何下一个单词,仅显示出现在先前输入的单词出现的文章中的单词。因此,它与您描述的任务类似。
为了让事情变得更简单,我们假设我们只有三个表。我想你有不同的模式,甚至细节也可能不同,但我的解释只是为了给出一个想法。
话
[Id,Word]表格包含单词(关键字)
索引
[Id、WordId、文章Id]
此表(也按 WordId 索引)列出了出现该术语的文章
ArticleRanges
[文章Id、索引IdFrom、索引IdTo]
此表列出了任何给定文章的 Index.Id 范围(显然也按 ArticleId 索引)。该表要求任何新的或更新的文章索引表应包含具有已知从到范围的条目。我认为任何 RDBMS 都可以通过自动增量功能的一点帮助来实现这一点
,因此对于任何给定的单词字符串,您
虽然我将它们列为单独的操作,但最终的查询可能只是基于解析的查询字符串的大sql。
From your tags I see you're more familiar with sql, and I think it's still possible to use it effectively for your task.
I have an application where a custom-made full-text search implemented using sqlite as a database. In the search field I can enter terms and popup list will show suggestions about the word and for any next word only those are shown that appears in the articles where previously entered words appeared. So it's similar to the task you described
To make things more simple let's assume we have only three tables. I suppose you have a different schema and even details can be different but my explanation is just to give an idea.
Words
[Id, Word] The table contains words (keywords)
Index
[Id, WordId, ArticleId]
This table (indexed also by WordId) lists articles where this term appeared
ArticleRanges
[ArticleId, IndexIdFrom, IndexIdTo]
This table lists ranges of Index.Id for any given Article (obviously also indexed by ArticleId) . This table requires that for any new or updated article Index table should contain entries having known from-to range. I suppose it can be achieved with any RDBMS with a little help of autoincrement feature
So for any given string of words you
Although I listed them as separate actions, the final query can be just big sql based on the parsed query string.