Solr 用于不断更新索引
我有一个新闻网站,其中有 150,000 篇新闻文章。每天大约有 250 篇新文章添加到数据库中,间隔时间为 5-15 分钟。我知道 Solr 针对数百万条记录进行了优化,我的 150K 对它来说不是问题。但我担心频繁的更新会成为一个问题,因为每次更新缓存都会失效。在我的开发服务器中,页面的冷加载需要 5-7 秒才能加载(因为每个页面都运行一些 MLT 查询)。
如果我将索引分成两个 - 存档索引和最新索引,会有帮助吗?归档索引每天更新一次。
谁能建议任何方法来优化我的安装以实现不断更新的索引?
谢谢
I have a news site with 150,000 news articles. About 250 new articles are added daily to the database at an interval of 5-15 minutes. I understand that Solr is optimized for millions of records and my 150K won't be a problem for it. But I am worried the frequent updation will be a problem, since the cache gets invalidated with every update. In my dev server, cold load of a page takes 5-7 seconds to load (since every page runs a few MLT queries).
Will it help, if I split my index into two - An archive index and a latest index. The archive index will be updated once every day.
Can anyone suggest any ways to optimize my installation for a constantly updating index?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我的答案是:测试一下!如果您不知道它的性能如何,请不要尝试优化。就像你说的,150K 并不是很多,应该很快为你的测试构建一个这个大小的索引。之后,从不同的并发线程运行几个 MLT 查询(以模拟用户),同时索引更多文档以查看其行为。
您应该关注的一项设置是自动提交。由于您不断建立索引,因此您无法提交每个文档(您将导致 Solr 崩溃)。您为此设置选择的值将允许您调整系统的延迟(在结果中返回新文档所需的时间),同时保持系统响应。
My answer is: test it! Don't try to optimize yet if you don't know how it performs. Like you said, 150K is not a lot, it should be quick to build an index of that size for your tests. After that, run a couple of MLT queries from a different concurrent threads (to simulate users) while you index more documents to see how it behaves.
One setting that you should keep an eye on is auto-commit. Since you are indexing constantly, you can't commit at each document (you will bring Solr down). The value that you will choose for this setting will let you tune the latency of the system (how many times it takes for new documents to be returned in results) while keeping the system responsive.
考虑在主查询中使用 mlt=true,而不是针对每个结果发出 MoreLikeThis 查询。您将节省往返时间,因此速度会更快。
Consider using mlt=true in the main query instead of issuing per-result MoreLikeThis queries. You'll save the roundtrips and so it will be faster.