如何在 Solr 中索引不同的源?

发布于 2024-11-06 21:50:33 字数 144 浏览 9 评论 0原文

如何在同一 Solr 模式中索引文本文件、网站和数据库?所有 3 个来源都是必需的,我正在尝试找出如何做到这一点。我做了一些示例,它们运行良好,因为它们彼此分离,现在我需要它们全部为 1 个模式,因为用户将在所有这 3 个数据源中进行搜索。

我应该如何进行?

How do I index text files, web sites and database in the same Solr schema? All 3 sources are a requirement and I'm trying to figure out how to do it. I did some examples and they're working fine as they're separate from each other, now I need them all to be 1 schema since the user will be searching in all of those 3 data sources.

How should I proceed?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

卸妝后依然美 2024-11-13 21:50:33

您应该为每个内容源勾画出一些注释:

  • 有哪些元数据可用
  • 如何访问信息
  • 如何呈现信息

完成后,确定您想要使哪些元数据可搜索。其中一些可能非常特定于其中一个内容源(例如网页上的作者,或数据库行中的任何给定字段),而另一些则将出现在所有源中(例如唯一 ID、标题、文本内容) )。使用 copy-fields 根据需要合并字段。

元数据因项目而异,但是是的,诸如更新日期、文件名以及您可以从文本文件中解析出的任何结构化数据之类的内容肯定会帮助您提高相关性。除此之外,具体情况因情况而异。也许文件路径暗示了您可以用作元数据的(可能是非正式的)分类法。也许文件名本身包含元数据(例如年份、关键字、产品名称等)。

显示结果时,准备好针对不同来源使用不同字段。 source 字段在创建结果图块方面大有帮助,而且它可能是您最常用的方面。

广泛使用复制字段的另一种(可能是首选)方法是使用 DisMax/EDisMax请求处理程序,以方便在多个字段中进行搜索。

考虑混合使用复制字段和 (e)dismax。例如,将所有字段复制到不需要存储的包罗万象的文本字段中,并将其包含在搜索中,但提升值较低,并包含高权重字段(例如标题、标题或关键字) ,或文件名)在搜索中。 dismax 有很多参数需要调整,但绝对值得付出努力。

You should sketch up a few notes for each of your content sources:

  • What meta-data is available
  • How is the information accessed
  • How do I want to present the information

Once that is done, determine which meta-data you want to make searchable. Some of it might be very specific to just one of the content sources (such as author on web pages, or any given field in a DB row), while others will be present in all sources (such as unique ID, title, text content). Use copy-fields to consolidate fields as needed.

Meta-data will vary greatly from project to project, but yes -- things like update date, filename, and any structured data you can parse out of the text files will surely help you improve relevance. Beyond that, it varies a lot from case to case. Maybe the file paths hint at a (possibly informal) taxonomy you can use as metadata. Maybe filenames contain metadata themselves (such as year, keyword, product names, etc).

Be prepared to use different fields for different sources when displaying results. A source field goes a long way in terms of creating result tiles -- and it might turn out to be your most used facet.

An alternative (and probably preferred) approach to using copy-fields extensively, is using the DisMax/EDisMax request handlers, to facilitate searching in several fields.

Consider using a mix of copy-fields and (e)dismax. For instance, copy all fields into a catch-all text-field, that need not be stored, and include it in searches, but with a low boost-value, and include highly weighted fields (such as title, or headings, or keywords, or filename) in the search. There's a lot of parameters to tweak in dismax, but it's definately worth the effort.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文