Lucene 索引:按帐户共享还是隔离?
我正在评估 Lucene 在 SaaS 应用程序中实现全局搜索功能。
我们不希望用户看到其他帐户的内容,因此搜索将始终受到帐户的限制。
是使用一个带有账户 ID 字段的单一索引更好,还是每个账户使用一个索引更好?每种方法的优点和缺点是什么?
我担心全局索引可能会由于频繁更新而影响性能。
谢谢。
编辑
- 估计文档总数:500,0000
- 帐户数量:4000
- 可索引数据不会在帐户之间共享
- 帐户用户可能每天多次更新其可索引数据(大多数情况下不超过 100
- )初始设置过程后索引数据量趋于稳定
- 我们需要每个文档存储 10-20 个字段
I'm evaluating Lucene to implement a global search feature in a SaaS application.
We do not want users to see the content of the other accounts so searches will always be limited by account.
Is it better to have one single index with an account id field or one index per account? What are the advantages and disadvantages of each approach?
My concern is that a global index might affect performance due to the frequent updates.
Thank you.
EDIT
- Estimated number of total documents: 500,0000
- Number of accounts: 4000
- Indexable data is never shared between accounts
- Account users might update their indexable data several times a day (not more than 100 in most cases)
- The amount of indexed data tends to be stable after the initial setup process
- We need to store 10-20 fields per document
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
除了常见问题(例如索引更新等)之外,这里还有一些我会考虑的事情:
here are some things I would think about in addition to the usual problems (e.g. index updates and such):
如果是我,如果没有监管原因不能这样做,我会将它们全部转储到一个索引中。这就是我的“不要优化不必要的东西”的说法。
第一个问题是合法的:您是否允许共同托管和混合数据,即使数据是通过逻辑方式分隔的。这取决于您的律师、客户和服务协议。这不是技术问题。
假设您可以,那么下一个问题是其他用户之间会产生什么影响。如果用户 A 正在使用系统,而用户 B 正在导入其 100K 文档,这会影响用户 A 吗?它对用户 A 的影响是因为 Lucene 的工作方式,还是仅仅因为导入和索引文档时出现的整体系统负载。
尝试一下看看。
关键是要确保您的客户端系统不直接访问 Lucene,而是通过某种外观访问。这个外观是强制执行客户端隔离的完美位置,并且如果稍后您决定需要对索引进行分片,它也是重定向流量的好地方。
也许您需要剔除一个重度用户。或者,您向某人出售更高水平的响应时间,以保证其 SLA 等中拥有更多资源。
但现在要决定更好的路径是什么?呃,看来还早呢。
500K 文档对于 Lucene 来说并不是很多数据。只要确保您的实施具有灵活性,以便在以后发现将所有功能托管在单个实例中不可行时添加功能。我所说的“添加功能”正是指添加它。实际上不要实施基于客户端的分片。而是有一个很好的观点,即可以在以后不重做一堆管道的情况下实施它。
If it were me, if there is no regulatory reason why you can not, I'd dump them all in to a single index. This is simply my "don't optimize what you don't have to" hat speaking.
The first concern is simply legal: are you even ALLOWED to co-host and intermix data, even if it is separated by logical means. That's up to your lawyers, customers, and service agreements. This is not a technical concern.
Assuming you can, then the next question is what impact will other users have upon each other. If User A is using the system and User B is in the process of importing their 100K documents, is that going to impact User A? Is it impacting User A because of how Lucene works, or simply because of the overall system load that occurs when importing and indexing documents.
Try it and see.
The key thing is to make sure that your client systems do not access Lucene directly, but rather through a facade of some kind. This facade is a perfect place to enforce the client segregation, and it's also a good place to redirect traffic if, at some later time, you decide you need to shard your indexes.
Perhaps you need to tear out a single heavy user. Or you sell a higher level of response time to someone that is guaranteed more resources in their SLA, etc.
But deciding, right now, what the better path is? Eh, seems early.
500K documents is not a lot of data to Lucene. Just make sure you have flexibility in your implementation to add capability later if you find out that hosting it all in a single instance isn't viable. And by "add capability" I mean exactly that, add it. Don't actually IMPLEMENT, say, sharding based on client. But rather have a good point where it COULD be implemented without redoing a bunch of plumbing later.
我到处做了一些“安全修剪”索引——如果允许的话绝对是可能的。也就是说,我对具有多个客户端的 SAAS 类型的东西的总体倾向是尽可能地将客户端分开,原因如下:
a)确保编码错误不会导致数据泄漏、愤怒的客户、诉讼和其他问题.
b) 使每个客户端的定制变得更加容易——您的整个代码库不需要处理特定于客户端的 fubar 请求
c) 从第一天起就迫使您采用水平可扩展的架构——如果添加实例很容易,那么扩展就很容易,对吧?
哦,绝对要采纳威尔·哈同(Will Hartung)的建议——立面搜索,那些东西真的不应该从它的层中爬出来。
I've done a few "security trimmed" indexes here and there -- definitely possible if it is allowed. That said, my general inclination with SAAS-type stuff with multiple clients would be to separate the clients as much as possible for a few reasons:
a) Ensures coding errors don't result in data leaks, angry clients, lawsuits and other hoo ha.
b) Makes per-client customization much easier -- your entire codebase need not deal with client-specific fubar requests
c) Forces you into a horizontally scalable architecture from day one -- scaling is easy if adding instances is easy, right?
Oh, and definitely take Will Hartung's advice -- facade search, that stuff really should not creep out of it's layer.