高效的过滤/搜索
我们有一个管理内容页面的托管应用程序。 每个页面可以有许多自定义字段和一些标准字段(时间戳、用户名、用户电子邮件等)。
由于可能有数百个不同的站点使用该系统——处理过滤/搜索的有效方法是什么? 想象一个您想要缩小范围的网格视图。 您可以过滤特定字段(用户 ID、日期),也可以输入全文搜索。
例如,“由用户 ID 10 启动的所有页面”对于 MySQL 数据库来说是一个非常快速的查询。 但是像“由用户 ID 为 10 且匹配 [某些搜索查询] 的用户启动的所有页面”之类的内容会对数据库产生不良影响,因此它适合像 Lucene 这样的搜索引擎。
基本上我想知道其他大型网站是如何做这种事情的。 他们是否 100% 使用搜索引擎进行所有类型的过滤? 他们是否将数据库查询与搜索引擎混合在一起?
如果我们仅使用搜索引擎,则新的/更新的对象出现在搜索索引中所需的延迟时间就会出现问题。 也就是说,我读到立即更新索引并分批更新并不明智。 即使这意味着每 5 分钟一次,当用户查看简单的页面列表(例如搜索查询“category:5”)时,如果最近添加的页面没有立即列出,他们也会感到困惑。
我们正在使用 MySQL,并且一直在密切关注 Lucene 的搜索。 还有其他我不知道的技术吗?
我的想法是提供一个简单的过滤页面,它使用 MySQL 来过滤基本字段。 然后提供一个单独的全文搜索页面,该页面将显示类似于 Google 的结果。 这是唯一的方法吗?
We have a hosted application that manages pages of content. Each page can have a number of customized fields, and some standard fields (timestamp, user name, user email, etc).
With potentially hundreds of different sites using the system -- what is an efficient way to handle filtering/searching? Picture a grid view that you want to narrow down. You can filter on specific fields (userid, date) or you can enter a full-text search.
For example, "all pages started by userid 10" would be a pretty quick query against a MySQL database. But things like "all pages started by a user whose userid is 10 and matches [some search query]" would suck against the database, so it's suited for a search engine like Lucene.
Basically I'm wondering how other large sites do this sort of thing. Do they utilize a search engine 100% for all types of filtering? Do they mix database queries with a search engine?
If we use only a search engine, there's a problem with the delay time it takes for a new/updated object to appear in the search index. That is, I've read that it's not smart to update the index immediately, and to do it in batches instead. Even if this means every 5 minutes, users will get confused when their recently added page isn't immediately listed when they view a simple page listing (say a search query of "category:5").
We are using MySQL and have been looking closely at Lucene for searching. Is there some other technology I don't know about?
My thought is to offer a simple filtering page which uses MySQL to filter on basic fields. Then offer a separate fulltext search page that would present results similar to Google. Is this the only way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Solr 或grassyknoll 都提供了稍微更抽象的Lucene 接口。
那说:是的。 如果您是一个主要内容驱动的网站,提供对数据的全文搜索,那么除了 LIKE 之外还有其他功能可以发挥作用。 虽然 MySql 的 FULLTEXT 索引并不完美,但它可能是过渡期间可接受的占位符。
假设您创建了一个 Lucene 索引,将 Lucene Documents 链接到您的关系对象非常简单,只需在索引时向文档添加一个存储的属性(该属性可以是 url、ID、GUID 等)。然后,搜索就变成了 2相位系统:
1)向 Lucene 索引发出查询(显示标题等简单结果)
2) 通过键从关系存储中获取有关该对象的更多详细信息
由于 Documents 的实例化在 Lucene 中相对昂贵,因此您只想存储在 Lucene 索引中搜索的字段,而不是关系对象的完整克隆。
Solr or grassyknoll both provide slightly more abstract interfaces to Lucene.
That said: Yes. If you are a primarily content driven site, providing fulltext searching over your data, there is something in play beyond LIKE. While MySql's FULLTEXT indexies aren't perfect, it might be an acceptable placeholder in the interim.
Assuming you do create a Lucene index, linking Lucene Documents to your relational objects is pretty straightforward, simply add a stored property to the document at index time (this property can be a url, ID, GUID etc.) Then, searching becomes a 2 phase system:
1) Issue query to Lucene indexies (Display simple results like title)
2) Get more detailed information about the object from your relational stores by its key
Since instantiation of Documents is relatively expensive in Lucene, you only want to store fields searched in the Lucene index, as opposed to complete clones of your relational objects.
不要这么轻易地注销 MySQL!
使用数据库来实现它,例如在 where 子句或其他子句中使用“like”进行选择。
对它进行分析,必要时添加索引。 推出测试版,这样您就可以从用户的实际数据模式中获得真实的数字 - 并非所有列都可能被同样询问,等等。
如果性能确实很差,那么您就需要考虑其他选项。 你可以考虑调整你的SQL、你的数据库、运行数据库的机器,最后使用另一个技术堆栈......
Don't write-off MySQL so readily!
Implement it using the database e.g. a select with a 'like' in the where-clause or whatever.
Profile it, add indexes if necessary. Roll out a beta, so you get real numbers from user's actual data patterns - not all columns might be equally asked after, etc.
If the performance does suck, then thats when you consider other options. You can consider tuning your SQL, your database, the machine the database is running on, and finally using another technology stack...
如果您想使用 MySQL 或 PostgreSQL,Sphinx 是一个非常适合使用的开源解决方案:
http://www.sphinxsearch.com/
我们遇到了同样的问题,并考虑使用 Sphinx 和 Lucene解决方案。
In case you want to use MySQL or PostgreSQL, a open source solution that works great with it is Sphinx:
http://www.sphinxsearch.com/
We are having the same problem and considering Sphinx and Lucene as possible solutions.