返回统计信息而不是点击次数的关键字搜索引擎
这是 StackOverflow 上的第一篇文章,但我一直将该网站视为共享知识的重要来源,并且我很高兴看到这个问题的结果。
因为我觉得我现在已经达到了 SQL 索引、统计和全文搜索的极限,所以我目前正在寻找一个可以为我们提供所需功能的搜索库。我并不反对自己编写它(如果我能得到老板的批准,也可以将其开源),但我更愿意找到已经存在的开源东西,当然。
我们追求的是一个搜索引擎,能够统计用户搜索特定关键词时匹配到的结果。例如,我们正在讨论在线商店中的产品数据库。我们需要能够返回有关有多少产品与给定关键字集匹配的统计信息(并且还能够按价格、类别等过滤此结果集)以及库存产品总数(假设存储在产品表的某个字段中)。我发现的所有搜索引擎都会返回前n个结果,如果您想要有关结果集大小的统计信息,则需要枚举整个结果集。即使您没有这样做,您仍然需要这样做才能检索库存产品的总数。
有谁知道有什么能够实现此功能吗?正如我所说,我很乐意亲自动手构建它,或者修改 Lucene 之类的功能,但我在 Google 上找不到任何合适的东西。
预先感谢各位!
First post on StackOverflow, but I've always looked to this site as a great source of shared knowledge, and I'm excited to see what comes up from this question.
As I feel I have now reached the limits of what I can do with SQL indexes, statistics and full-text search, I'm currently looking for a search library that can provide us with the functionality we need. I'm not averse to writing it myself (and open-sourcing it if I can get the boss's approval), but I would prefer to find something open-source that already exists, natch.
What we're after is a search engine that can provide statistics about the results that are matched when a user searches for a specific keyword. Let's say, for example, that we were talking about a database of products in an online shop. We need to be able to return statistics about how many products there are that match a given set of keywords (and also be able to filter this result set by price, category, etc), as well as the total number of products in stock (assuming that this is stored in a field in the product table). All the search engines that I have found return the top n results, and if you want statistics about the size of the result set, you need to enumerate the whole set. Even if you didn't you still would need to do so to retrieve the total number of products in stock.
Is there anything anyone knows of that is capable of this functionality? As I say, I'm happy to get my hands dirty and either build it myself, or modify the functionality of something like Lucene, but I have not been able to find anything appropriate on Google.
Thanks in advance guys!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以查看 Solr,它是一个构建在 Lucene 之上的多面搜索引擎。除了进行全文搜索之外,Solr 还会为您计算许多不同的事情。它擅长处理结构化数据和全文数据的组合。
You might take a look at Solr, which is a faceted search engine built on top of Lucene. Solr will count lots of different things for you, in addition to doing full-text search. It is good at handling combinations of structured and full-text data.
这里要记住的是,“枚举所有结果”可能意味着非常不同的事情 -
select count(*)
与实际获取每个对象所需的所有连接等非常不同。 Lucene 和关系数据库都是如此。因此,我不会担心文档中所说的“我们枚举所有结果”这一事实。根据我的经验,Solr 的标准方面可以满足 99% 的人的需求。如果您属于那 1%(即您有一个巨大数据库),那么我可以建议一些更快地猜测结果的方法。但 Solr 可能会为你工作。
Something to keep in mind here is that "enumerating all results" can mean very different things -
select count(*)
is very different from doing all the joins etc. required to actually get each object. This is true in Lucene as well as relational databases. So I wouldn't worry about the mere fact that the documentation says "we enumerate all results."It's been my experience that the standard faceting of Solr scales to what 99% of people need. If you are in that 1% (i.e. you have a huge database) then I can suggest some ways of guessing the results which can be quicker. But Solr will probably work for you.
你确定吗?我问这个问题是因为如果您使用 MySQL,您可能需要查看 全文搜索 PostgreSQL 的功能。特别是当您将它与 btree_gin 和 trigram 模块,以及非常不错的 trigram 模块postgresql.org/docs/9.0/static/row-estimation-examples.html" rel="nofollow">解释 功能,允许您从高度复杂的查询中提取合理的行估计。
Are you sure? I ask because if you are using MySQL, you might want to look into the full text search functionality of PostgreSQL. Especially when you combine it with the btree_gin and the trigram modules, and the extremely decent explain functionality that allows you to extract reasonable row estimates from highly complex queries.