搜索自然语言句子结构
存储和搜索自然语言句子结构树数据库的最佳方式是什么?
使用 OpenNLP 的 英语树库解析器,我可以获得任意句子的相当可靠的句子结构解析。我想做的是创建一个工具,可以从源代码中提取所有文档字符串,为文档字符串中的所有句子生成这些树,将这些树及其关联的函数名称存储在数据库中,然后允许用户使用自然语言查询来搜索数据库。
因此,对于函数 upload_files()
来说,给定句子 “这会将文件上传到远程计算机。”
,我会得到树:
(TOP
(S
(NP (DT This))
(VP
(VBZ uploads)
(NP (NNS files))
(PP (TO to) (NP (DT a) (JJ remote) (NN machine))))
(. .)))
如果有人输入查询“How我可以上传文件吗?”,相当于树:
(TOP
(SBARQ
(WHADVP (WRB How))
(SQ (MD can) (NP (PRP I)) (VP (VB upload) (NP (NNS files))))
(. ?)))
我如何在 SQL 数据库中存储和查询这些树?
我编写了一个简单的概念验证脚本,可以使用正则表达式和网络图解析的组合来执行此搜索,但我不确定如何以可扩展的方式实现此搜索。
是的,我意识到使用简单的关键字搜索来检索我的示例是微不足道的。我试图测试的想法是如何利用语法结构,这样我就可以淘汰具有相似关键字但句子结构不同的条目。例如,对于上面的查询,我不想检索与句子“检查远程计算机以查找上传文件的用户。”相关的条目,该句子具有类似的关键字,但显然是描述完全不同的行为。
What's the best way to store and search a database of natural language sentence structure trees?
Using OpenNLP's English Treebank Parser, I can get fairly reliable sentence structure parsings for arbitrary sentences. What I'd like to do is create a tool that can extract all the doc strings from my source code, generate these trees for all sentences in the doc strings, store these trees and their associated function name in a database, and then allow a user to search the database using natural language queries.
So, given the sentence "This uploads files to a remote machine."
for the function upload_files()
, I'd have the tree:
(TOP
(S
(NP (DT This))
(VP
(VBZ uploads)
(NP (NNS files))
(PP (TO to) (NP (DT a) (JJ remote) (NN machine))))
(. .)))
If someone entered the query "How can I upload files?", equating to the tree:
(TOP
(SBARQ
(WHADVP (WRB How))
(SQ (MD can) (NP (PRP I)) (VP (VB upload) (NP (NNS files))))
(. ?)))
how would I store and query these trees in a SQL database?
I've written a simple proof-of-concept script that can perform this search using a mix of regular expressions and network graph parsing, but I'm not sure how I'd implement this in a scalable way.
And yes, I realize my example would be trivial to retrieve using a simple keyword search. The idea I'm trying to test is how I might take advantage of grammatical structure, so I can weed-out entries with similar keywords, but a different sentence structure. For example, with the above query, I wouldn't want to retrieve the entry associated with the sentence "Checks a remote machine to find a user that uploads files."
which has similar keywords, but is obviously describing a completely different behavior.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
关系数据库无法以自然的方式存储知识,您真正需要的是知识库或本体(尽管它可能构建在关系数据库之上)。它以三元组
<主语、谓语、宾语>
形式保存数据,因此您的短语将存储为
。有很多工具和方法可以在此类知识库中进行搜索(例如,Prolog 就是一种专门用于执行此操作的语言)。因此,您所要做的就是将句子从自然语言翻译为知识库三元组/本体图,将用户查询翻译为不完整的三元组(您的问题将类似于)或联合查询,然后搜索您的知识库。 OpenNLP 将帮助您进行翻译,其余的取决于您决定使用的具体技术和技术。
Relational databases cannot store knowledge in a natural way, what you actually need is a knowledge base or ontology (though it may be constructed on top of relational database). It holds data in triplets
<subject, predicate, object>
, so your phrase will be stored as<upload_file(), upload, file>
. There's a lot of tools and methods to search inside such KBs (for example, Prolog is a language that was designed to do it). So, all you have to do is to translate sentences from natural language to KB triplets/ontology graph, translate user query to incomplete triplets (your question will look like<?, upload, file>
) or conjunctive queries and then search on your KB. OpenNLP will help you with translating, and the rest depends on concrete technique and technologies you decide to use.我同意 ffriend 你需要采取不同的方法来构建关于知识库和自然语言搜索的现有工作。在关系数据库中存储上下文无关的解析树不是问题,但作为搜索的一部分对解析树进行有意义的比较将非常困难。当你只是想利用一点语法关系知识时,解析树确实太复杂了。如果将解析简化为依赖三元组,则可以使搜索问题变得更加容易,并首先获得您感兴趣的语法关系。例如,您可以使用 Stanford 依赖解析器,它会生成上下文 -自由解析,然后从中提取依赖三元组。它为“此函数将文件上传到远程计算机”生成这样的输出:
在数据库中,您可以存储与该函数关联的这些三元组的简化子集,例如:
当人们搜索时,您可以找到具有最多文件的函数重叠三元组或类似的东西,您可能还想加权不同的依赖关系或允许部分匹配等。您可能还想将三元组中的单词减少为引理,也许是 POS,具体取决于您的需要。
有很多人从事自然语言搜索(如 Powerset)工作,因此请务必搜索现有方法。我在这里提出的方法实际上是最小的,我可以想到大量会出现问题的例子,但我认为沿着这些思路的一些方法对于受限域来说可以相当好地工作。
I agree with ffriend that you need to take a different approach that builds on existing work on knowledge bases and natural language search. Storing context-free parse trees in a relational database isn't the problem, but it is going to be very difficult to do a meaningful comparison of parse trees as part of a search. When you are just interested taking advantage of a little knowledge about grammatical relations, parse trees are really too complicated. If you simplify the parse into dependency triples, you can make the search problem much easier and get at the grammatical relations you were interested in in the first place. For instance, you could use the Stanford dependency parser, which generates a context-free parse and then extracts dependency triples from it. It produces output like this for "This function uploads files to a remote machine":
In your database, you could store a simplified subset of these triples associated with the function, e.g.:
When people search, you can find the function that has the most overlapping triples or something along those lines, where you probably also want to weight the different dependency relations or allow partial matches, etc. You probably also want to reduce the words in the triples to lemmas, maybe POS depending on what you need.
There are plenty of people who have worked on natural language search (like Powerset), so be sure to search for existing approaches. My proposed approach here is really minimal and I can think of tons of examples where it will have problems, but I think something along these lines could work reasonably well for a restricted domain.
最好的选择是预处理解析器输出并使用 tgrep2 搜索它:
这不是一个完整的答案,但如果您想在树上执行语言上复杂的查询, edu/dept/linguistics/corpora/cas-tut-tgrep.html" rel="nofollow">http://www.stanford.edu/dept/linguistics/corpora/cas-tut-tgrep.html
Trgrep据我所知,/tgrep2 是用于搜索解析树的最灵活、功能最齐全的包。这不是您所要求的基于 MySQL 的解决方案,但我认为您可能有兴趣了解此选项。
Tgrep2 允许您询问有关父母、后代和兄弟姐妹的问题,而其他解决方案不会保留解析的完整树结构或允许如此复杂的查询。
This is not a complete answer, but if you want to perform linguistically sophisticated queries on your trees, the best bet is to pre-process your parser output and search it with tgrep2:
http://www.stanford.edu/dept/linguistics/corpora/cas-tut-tgrep.html
Trgrep/tgrep2 are, as far as I know, the most flexible and full-featured packages for searching parse trees. This is not a MySQL-based solution as you requested, but I thought you might be interested to know about this option.
Tgrep2 allows you to ask questions about parents, descendants and siblings, whereas other solutions would not retain the full tree structure of the parse or allows such sophisticated queries.