使用 Lucene 查询 RDBMS 数据库

发布于 2024-10-12 03:45:06 字数 643 浏览 2 评论 0原文

我浏览了 Lucene Java 版本的文档,但到目前为止我还没有真正看到顶级的“这就是它的工作原理”信息(我知道我需要 RTFM,我只是看不到树木为木)。

我了解 Lucene 使用搜索索引来返回结果。据我所知,它只返回这些索引的“命中”。如果我在构建索引时没有添加数据项,那么它不会被返回。

没关系,现在我想检查以下假设:

问:这是否意味着我想要在搜索页面上显示的任何数据都需要添加到 Lucene 索引中?


如果我想通过 sku、描述、类别名称等搜索 Product,但我还想在搜索结果中显示它们所属的 Customer,请执行以下操作I:

  1. 确保 Lucene 索引在索引中包含非规范化的 Customer 名称。
  2. 使用 Lucene 返回的命中以某种方式在数据库中查询实际产品记录,并使用 JOIN 获取 Customer 的名称。

我假设它是选项1,因为我假设没有办法将 Lucene 查询的结果“连接”到 RDBMS,但想问一下我对一般用法的假设是否正确。

I've skimmed the docs for the Java version of Lucene, but I can't really see the top-level "this is how it works" info so far (I'm aware I need to RTFM, I just can't see the wood for the trees).

I understand Lucene uses search indexes to return results. As far as I know, it only returns "hits" from those indexes. If I haven't added an item of data when building the index then it won't be returned.

That's fine, so now I want to check the following assumption:

Q: Does that mean that any data I want displayed on a search page needs to be added to the Lucene index?

I.e.
If I want to search for Products by things like sku, description, category name, etc, but I also want to display the Customer they belong to in search results, do I:

  1. Make sure the Lucene index has the denormalised Customer's name in the index.
  2. Use the hits returned by Lucene to somehow query the database for the actual product records and use a JOIN to get the Customer's name.

I assume it's option 1, since I'm assuming there's no way to "join" the results of a Lucene query to an RDBMS, but wanted to ask it my assumptions about the general usage are correct.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

生生漫 2024-10-19 03:45:06

通常索引只包含您想要搜索的字段,不一定包含您想要显示的字段。索引应优化为尽可能小,以保持良好的搜索性能。

为了能够显示更多数据,请在索引中添加一个字段,以便您检索完整的文档/数据,即您的Product 的唯一键(产品 ID?)。

Usually the index would only contain the fields you want to search on, not necessarily the ones you want to display. Indexes should be optimized to be as small as possible, to keep search performance good.

To be able to display more data add a field to your index that allows you to retrieve your full document/data, i.e. a unique key for your Product (product id?).

记忆之渊 2024-10-19 03:45:06

我一直在试图解决同样的问题,但我认为工作量太大了。我正在考虑将此作为替代方案。如果我的想法有误,请纠正我!

你的情况是这样的:
RDBMS 产品(很多)<------> (很多)客户

我建议不要只将客户放入 lucene 索引中以获取产品密钥,然后使用 IN Query 查询 RDBMS,而是使用 Product 和 Customer 的笛卡尔积创建 lucene 索引。

喜欢
客户_1、产品_1
客户_1、产品_2
customer_2、product_2..

这样,当您在 lucene 中搜索产品时,它会同时提供客户和产品 id.. 并且无需将它们加入 RDBMS,您只需查找这些客户以及如果需要,请从 RDBMS 产品获取更多信息。如果您使用缓存,那么额外的详细信息查找成本也会下降。

I have been trying to figure out the same problem, but I think that its too much work. I'm thinking of this as an alternative. Plse correct me if I'm wrong in my thinking!

Your situation is like this:
RDBMS product (many) <------> (many) Customer

Instead of putting only customer in lucene index to get product keys, and then query RDBMS with IN Query, I'd suggest, create the lucene index with the cartesian product of Product as well as Customer.

Like
customer_1, product_1
customer_1, product_2
customer_2, product_2..

This way, when you are searching for a product in lucene, it will give both the customer as well as the products id.. and instead of joining them in RDBMS, you can simply look up those customers as well as products for more information from RDBMS, if there is a need. If you are using caching, then the additional details lookup cost will also go down.

绮筵 2024-10-19 03:45:06

根据 BrokenGlass 的回答,我想到了一些更多,我提出以下建议,看看我是否正确:

基本上,进一步采用选项 2,可以执行以下操作:

  1. 仅将要搜索的数据放入 Lucene 索引中,加上某种键值(例如数据库中表的 PK)。
  2. 查询 Lucene 以获取命中列表。
  3. 使用您选择的数据访问层,为您的数据库构建一个包含 IN (value [, value]) 谓词的查询。
  4. 从数据库中获取该查询的结果(其中很可能包括与其他表的JOIN)。
  5. 将这些结果放入字典中,使用结果集的 PK 作为键。
  6. 按顺序再次迭代 Lucene 命中,使用 PK 从字典中提取项目,以便您可以按照 Lucene 返回命中的顺序构建结果列表(即按相关性排序)。
  7. 向用户显示“排序”的结果列表。

当然,第 5 步和第 6 步可能会更好,但为了解释起见,我在描述中添加了详细的方法。如果 Lucene 命中包含某种“相关性”值,那么您可以将其归因于结果集并执行标准排序,但这对读者来说是一个练习。 :)

可能是这个吗?

Based on BrokenGlass's answer, I've thought some more and am proposing the following to see if I'm on the right lines:

Basically, taking option 2 further, one could do the following:

  1. Put only the data you want to search on into the Lucene index, plus some sort of key value (e.g. the PK of a table in your database).
  2. Query Lucene to get a list of hits.
  3. Using your data access layer of choice, build a query for your database that includes an IN (value [, value]) predicate.
  4. Get the results for that query from your database (which may well include JOINs to other tables).
  5. Put those results in a dictionary, using the PK of the resultset as the key.
  6. Iterate the Lucene hits again in order, pulling the items from the dictionary using the PK so you can build a list of results in the order that Lucene returned the hits (i.e. sorted by relevance).
  7. Display that "sorted" list of results to the user.

Of course steps 5 and 6 could be better, but for the sake of explanation I put that verbose method in my description. If the Lucene hits include some sort of "relevance" value, then you could attribute that to the resultset and perform a standard sort, but that's an exercise for the reader. :)

Could this be it?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文