如何构建概念搜索引擎?
我想构建一个能够将查询映射到概念的内部搜索引擎(我有数千个 XML 文件的非常大的集合)。例如,如果我搜索“大型猫科动物”,我希望排名较高的结果也返回包含“大型猫科动物”的文档。但我可能也有兴趣让它返回“巨大的动物”,尽管相关性得分要低得多。
我目前正在阅读《Python 中的自然语言处理》一书,似乎 WordNet 有一些可能有用的单词映射,尽管我不确定如何将其集成到搜索引擎中。我可以使用 Lucene 来做到这一点吗?如何?
从进一步的研究来看,“潜在语义分析”似乎与我正在寻找的内容相关,但我不确定如何实现它。
关于如何完成这项工作有什么建议吗?
I would like to build an internal search engine (I have a very large collection of thousands of XML files) that is able to map queries to concepts. For example, if I search for "big cats", I would want highly ranked results to return documents with "large cats" as well. But I may also be interested in having it return "huge animals", albeit at a much lower relevancy score.
I'm currently reading through the Natural Language Processing in Python book, and it seems WordNet has some word mappings that might prove useful, though I'm not sure how to integrate that into a search engine. Could I use Lucene to do this? How?
From further research, it seems "latent semantic analysis" is relevant to what I'm looking for but I'm not sure how to implement it.
Any advice on how to get this done?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
步骤 1. 停止。
第 2 步:让某事发挥作用。
步骤 3. 届时,您将更多地了解 Python 和 Lucene 以及其他工具以及可能集成它们的方法。
不要从尝试解决集成问题开始。软件始终可以集成。这就是操作系统的作用。它集成了软件。有时您想要“更紧密”的集成,但这永远不是要解决的第一个问题。
要解决的第一个问题是让您的搜索或概念事物或其他任何东西作为愚蠢的旧命令行应用程序工作。或者通过传递文件将一对应用程序结合在一起,或者通过操作系统管道或其他东西结合在一起。
稍后,您可以尝试找出如何实现无缝的用户体验。
但不要从集成开始,也不要因为集成问题而停滞不前。把集成放在一边,让一些东西发挥作用。
Step 1. Stop.
Step 2. Get something to work.
Step 3. By then, you'll understand more about Python and Lucene and other tools and ways you might integrate them.
Don't start by trying to solve integration problems. Software can always be integrated. That's what an Operating System does. It integrates software. Sometimes you want "tighter" integration, but that's never the first problem to solve.
The first problem to solve is to get your search or concept thing or whatever it is to work as a dumb-old command-line application. Or pair of applications knit together by passing files around or knit together with OS pipes or something.
Later, you can try and figure out how to make the user experience seamless.
But don't start with integration and don't stall because of integration questions. Set integration aside and get something to work.
这是一个极其困难的问题,并且无法以总是产生足够结果的方式解决。我建议坚持一些非常简单的原则,这样结果至少是可以预测的。我认为你需要两件事:一些基本的形态引擎加上同义词词典。
每当搜索查询到达时,对于每个单词,您
然后对所有单词重复输入单词的组合,即“big cats”、“big cat”、“huge cats”、“huge cat”等。
事实上,您也需要以规范形式(单数、第一种形式等)存储索引数据 至于概念
,比如猫也是动物——这就是它从来没有真正起作用的地方,因为否则谷歌就会返回概念匹配,但它没有这样做。
This is an incredibly hard problem and it can't be solved in a way that would always produce adequate results. I'd suggest to stick to some very simple principles instead so that the results are at least predictable. I think you need 2 things: some basic morphology engine plus a dictionary of synonyms.
Whenever a search query arrives, for each word you
Then repeat for all combinations of the input words, i.e. "big cats", "big cat", "huge cats" huge cat" etc.
In fact, you need to store your index data in canonical form, too (singluar, first form etc) along with the literal form.
As for concepts, such as cats are also animals - this is where it gets tricky. It never really worked, because otherwise Google would have been returning conceptual matches already, but it's not doing that.
首先,我同意这里的大部分建议,即缓慢开始,首先构建这个宏伟计划的各个部分,设计一个最小的第一个产品,然后从那里继续。
其次,如果您想在 Lucene 中使用一些 Wordnet 功能,可以使用 contrib 包,用于将 Lucene 查询与 Wordnet 连接。我不知道它是否被移植到 pylucene 上。祝你好运,出门要小心。
First, I agree with most of the advice here about starting slow, and first building bits and pieces of this grand plan, devising a minimal first product and continuing from there.
Second, if you want to use some Wordnet functionality with Lucene, there is a contrib package for interfacing Lucene queries with Wordnet. I have no idea whether it was ported over to pylucene. Good luck and be careful out there.
首先,编写一段Python代码,当你输入apple时,它会返回pineapple、orange、papaya。通过关注语义网络的“是”关系。然后继续有关系等等。
我认为最后,您可能会为学校项目获得相当足够的代码。
First , write a piece of python code that will return you pineapple , orange , papaya when you input apple. By focusing on "is" relation of semantic network. Then continue with has a relationship and so on.
I think at the end , you might get a fairly sufficient piece of code for a school project.