Lucene 索引:按帐户共享还是隔离?

发布于 2024-11-03 10:23:04 字数 395 浏览 1 评论 0原文

我正在评估 Lucene 在 SaaS 应用程序中实现全局搜索功能。

我们不希望用户看到其他帐户的内容,因此搜索将始终受到帐户的限制。

是使用一个带有账户 ID 字段的单一索引更好,还是每个账户使用一个索引更好?每种方法的优点和缺点是什么?

我担心全局索引可能会由于频繁更新而影响性能。

谢谢。

编辑

  • 估计文档总数:500,0000
  • 帐户数量:4000
  • 可索引数据不会在帐户之间共享
  • 帐户用户可能每天多次更新其可索引数据(大多数情况下不超过 100
  • )初始设置过程后索引数据量趋于稳定
  • 我们需要每个文档存储 10-20 个字段

I'm evaluating Lucene to implement a global search feature in a SaaS application.

We do not want users to see the content of the other accounts so searches will always be limited by account.

Is it better to have one single index with an account id field or one index per account? What are the advantages and disadvantages of each approach?

My concern is that a global index might affect performance due to the frequent updates.

Thank you.

EDIT

  • Estimated number of total documents: 500,0000
  • Number of accounts: 4000
  • Indexable data is never shared between accounts
  • Account users might update their indexable data several times a day (not more than 100 in most cases)
  • The amount of indexed data tends to be stable after the initial setup process
  • We need to store 10-20 fields per document

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

反目相谮 2024-11-10 10:23:04

除了常见问题(例如索引更新等)之外,这里还有一些我会考虑的事情:

  1. lucene 返回排名结果的方式取决于一些“语料库范围”的统计数据,例如某个术语出现的文档总数对于那个领域。因此,如果客户 a 的索引统计数据不适合客户 b,除了存在安全风险之外,还会损害两个客户的相关性……如果 oscar 足够聪明,他真的可以开始反转 Bob 的文档,因为该文档的性质倒排索引: http://citeseerx.ist.psu.edu/ viewdoc/summary?doi=10.1.1.159.9682 您可能可以使用以下排名算法来解决此问题:https://issues.apache.org/jira/browse/LUCENE-2864
  2. lucene 中的一些其他内容适用于“作为一个整体的字段”或“作为一个整体的索引”和您应该知道,如果您将索引分组在一起,则它们不能真正针对每个客户进行更改:例如 omitTF(如果您将其设置在某个字段的单个文档上,则该字段将被全面省略)、相似性(在 lucene 的任何发布版本中,您只能全面设置相似性,因此客户将无法调整排名模型)、拼写检查(您必须进行一些修改,每个客户都有自己的“过滤”)拼写检查索引),...
  3. 另一方面,如果您有很多术语,则需要相当多的 RAM,并且通过为每个客户提供自己的索引,您将需要更多内存来在 RAM 中保存术语索引,对于所有索引。但是,您可以通过调整 termIndexInterval/Divisor 等内容来稍微降低此值。

here are some things I would think about in addition to the usual problems (e.g. index updates and such):

  1. The way lucene returns ranked results depends upon some "corpus-wide" statistics, for example the total number of documents that a term appears in for that field. So, if the index statistics for customer a are inappropriate for customer b, its going to hurt relevance for both customers, besides being a security risk... if oscar is smart enough he truly can start reversing bob's documents because of the nature of the inverted index: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.9682 You could probably work around this with something like this ranking algorithm: https://issues.apache.org/jira/browse/LUCENE-2864
  2. Some other things in lucene apply to a "field as a whole" or "index as a whole" and you should know that they can't be really changed on a per-customer basis if you group indexes together: things like omitTF (if you set it on a single document for a field, its omitted across the board for that field), similarity (in any released version of lucene, you can only set similarity across the board, so customers wouldn't be able to tune the ranking model), spellchecking (you would have to hack something up, where each customer has their own "filtered" spellcheck index), ...
  3. On the other hand, if you have many terms, quite a bit of RAM is required and by giving each customer their own index, you will need more memory to hold the terms index in RAM, for all the indexes. You can however, lower this somewhat by adjusting things like termIndexInterval/Divisor.
2024-11-10 10:23:04

如果是我,如果没有监管原因不能这样做,我会将它们全部转储到一个索引中。这就是我的“不要优化不必要的东西”的说法。

第一个问题是合法的:您是否允许共同托管和混合数据,即使数据是通过逻辑方式分隔的。这取决于您的律师、客户和服务协议。这不是技术问题。

假设您可以,那么下一个问题是其他用户之间会产生什么影响。如果用户 A 正在使用系统,而用户 B 正在导入其 100K 文档,这会影响用户 A 吗?它对用户 A 的影响是因为 Lucene 的工作方式,还是仅仅因为导入和索引文档时出现的整体系统负载。

尝试一下看看。

关键是要确保您的客户端系统不直接访问 Lucene,而是通过某种外观访问。这个外观是强制执行客户端隔离的完美位置,并且如果稍后您决定需要对索引进行分片,它也是重定向流量的好地方。

也许您需要剔除一个重度用户。或者,您向某人出售更高水平的响应时间,以保证其 SLA 等中拥有更多资源。

但现在要决定更好的路径是什么?呃,看来还早呢。

500K 文档对于 Lucene 来说并不是很多数据。只要确保您的实施具有灵活性,以便在以后发现将所有功能托管在单个实例中不可行时添加功能。我所说的“添加功能”正是指添加它。实际上不要实施基于客户端的分片。而是有一个很好的观点,即可以在以后不重做一堆管道的情况下实施它。

If it were me, if there is no regulatory reason why you can not, I'd dump them all in to a single index. This is simply my "don't optimize what you don't have to" hat speaking.

The first concern is simply legal: are you even ALLOWED to co-host and intermix data, even if it is separated by logical means. That's up to your lawyers, customers, and service agreements. This is not a technical concern.

Assuming you can, then the next question is what impact will other users have upon each other. If User A is using the system and User B is in the process of importing their 100K documents, is that going to impact User A? Is it impacting User A because of how Lucene works, or simply because of the overall system load that occurs when importing and indexing documents.

Try it and see.

The key thing is to make sure that your client systems do not access Lucene directly, but rather through a facade of some kind. This facade is a perfect place to enforce the client segregation, and it's also a good place to redirect traffic if, at some later time, you decide you need to shard your indexes.

Perhaps you need to tear out a single heavy user. Or you sell a higher level of response time to someone that is guaranteed more resources in their SLA, etc.

But deciding, right now, what the better path is? Eh, seems early.

500K documents is not a lot of data to Lucene. Just make sure you have flexibility in your implementation to add capability later if you find out that hosting it all in a single instance isn't viable. And by "add capability" I mean exactly that, add it. Don't actually IMPLEMENT, say, sharding based on client. But rather have a good point where it COULD be implemented without redoing a bunch of plumbing later.

怪我太投入 2024-11-10 10:23:04

我到处做了一些“安全修剪”索引——如果允许的话绝对是可能的。也就是说,我对具有多个客户端的 SAAS 类型的东西的总体倾向是尽可能地将客户端分开,原因如下:

a)确保编码错误不会导致数据泄漏、愤怒的客户、诉讼和其他问题.
b) 使每个客户端的定制变得更加容易——您的整个代码库不需要处理特定于客户端的 fubar 请求
c) 从第一天起就迫使您采用水平可扩展的架构——如果添加实例很容易,那么扩展就很容易,对吧?

哦,绝对要采纳威尔·哈同(Will Hartung)的建议——立面搜索,那些东西真的不应该从它的层中爬出来。

I've done a few "security trimmed" indexes here and there -- definitely possible if it is allowed. That said, my general inclination with SAAS-type stuff with multiple clients would be to separate the clients as much as possible for a few reasons:

a) Ensures coding errors don't result in data leaks, angry clients, lawsuits and other hoo ha.
b) Makes per-client customization much easier -- your entire codebase need not deal with client-specific fubar requests
c) Forces you into a horizontally scalable architecture from day one -- scaling is easy if adding instances is easy, right?

Oh, and definitely take Will Hartung's advice -- facade search, that stuff really should not creep out of it's layer.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文