lucene索引更新策略

发布于 2025-01-01 00:08:28 字数 402 浏览 2 评论 0原文

我正在为我的应用程序构建 lucene Web 服务器(用 Java 编写),并期望上游应用程序每秒对该服务器进行近 100 次搜索点击(该服务器将托管在由负载均衡器平衡的各种物理盒子上)。

数据视角 我将拥有近 50K 文档(每个文档大小小于 1kb),并且每天都会有新的/更新的约 500 个文档。

我想知道每天在 500 个文档上构建索引而不影响上游扫描过程性能的最推荐方法。

我无法使用所有服务器之间的任何共享位置来共享文件索引,我能想到的几个选项

1)使用数据库索引(JDBC 目录) - 不确定优点和缺点 2)使用 RAMDirectory 索引 - 不确定更新策略。 3) 使用文件索引 - 无法想到稳健的设计来在各种物理设备之间构建和循环文件基础索引。

想知道有关正确索引设置策略的想法/建议。

I'm working on building lucene web server (in Java) for my application and expecting almost 100 search hits/second by upstream application to this server (this server will be hosted on various physical boxes which is balanced by a load balancer).

Data perspective I will be having almost 50K documents (each document less than 1kb size) and daily having new/updated ~500 documents.

I would like to know most recommended way to build indexes on 500 documents daily without impacting performance on upstream scan process.

I cannot use any shared location between all my servers for file index sharing, couple of options I can think of

1) use DB indexes (JDBC Directory) - not sure on PROs and CONs
2) use RAMDirectory indexes - not sure on update strategy.
3) use file indexes - cannot think of robust design to build and circulate file base indexes between various physical boxes.

Would like to know thoughts/recommendations on correct index setup strategy.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

活泼老夫 2025-01-08 00:08:28

您真的需要自己构建查询/索引服务器吗?

您考虑过 ElasticSearch 吗?它会自动分区和复制您的索引,您只需配置所需的分区数量以及每个分区的副本数量。它还将为您提供一个简单的 HTTP 接口来进行索引和查询。在 ElasticSearch 中,所有节点/实例都是平等的,因此您可以向任何节点发送和查询文档。

对于小至 50K 的索引,我想具有几个副本的单个分区可以满足您每秒 100 个查询的要求。

无论如何,看来你的要求很轻。小于 1KB 的 50K 文档似乎非常适合内存索引(lucene 中的 RAMDirectory)。根据将向索引发出的查询,您可以使用更少的机器处理每秒 100 个查询。

考虑到您对更新延迟没有硬性要求并且新文档的数量很少,新文档的索引可以通过很多方式完成。您可以通过 HTTP 将文档发送到每个实例,通过 ssh/ftp 发送包含更新文档的 CSV 文件(或其他文件),并且每个实例每天索引一次该文件。

Do you really need to build the Query/Indexing server by yourself?

Have you considered ElasticSearch? It will partition and replicate your index automatically, you just need to configure how many partition you want and how many replicas for each partition. It will also give you a simple HTTP interface to index and query. In ElasticSearch all nodes/instances are equal so you can send and query documents to any of the nodes.

With an index as small as 50K I guess a single partition with a few replicas would handle your 100 queries/second requirement.

Anyway it seems that your requirements are light. 50K documents with less that 1KB seems like a good fit for an in-memory index (RAMDirectory in lucene). Depending on the queries that will be issued to the index you could have less machines handle the 100 queries/second.

The indexing of new documents can be done in a lot of ways, considering that you don't have hard requirements on the update latency and the number of new documents is very small. You could send the documents via HTTP to each instance, send via ssh/ftp a CSV file (or something else) with the updated documents, and once a day each instance index this file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文