搜索技术建议

发布于 2024-09-29 21:48:02 字数 744 浏览 2 评论 0原文

这更多的是一个理论问题而不是实践问题。我正在开发一个非常简单的链接目录项目。整个模型类似于 Dmoz 或 Yahoo 目录,只是每个条目都有某些附加属性。

我对具有多对多关系的所有条目进行层次分类法,所有条目现在都被分类到这些类别中,并且一切似乎都工作正常。现在,如果没有搜索选项,目录有什么用呢?

以下是关于我的模型的更多详细信息:每个条目都有标题、描述、URL 和几个社交配置文件:YouTube、Twitter、Flickr 等。每个条目都可以附加一个徽标,以及一个隐藏的标签字段。此外,标题和描述以三种不同的语言存储。所以基本上我希望搜索结果是:

  1. 相关的(包括分类法)
  2. 可能有徽标
  3. 可能有 100% 填写的配置文件

我已经尝试过 Sphinx 并且目前正在使用 Lucene,但似乎我没有得到理论上搜索正确。我希望填充的条目应该比其他条目显示得更高确实有意义,但我无法真正计算出分数。如果整个描述中只有一个单词匹配,我不希望不相关的条目出现在顶部,因为标题更相关。

所以我的问题是 - 是否有任何书籍、技术甚至其他搜索引擎(如果 Sphinx 和 Lucene 不够好)您会针对此事推荐?我不仅希望完全控制搜索结果及其排名,而且还向访问者提供正确且相关的信息。

很酷的文章链接也很受欢迎!

,我并不想重建 Google :)

谢谢 :)

This is more of a theory question rather than practice. I'm working on a project which is quite a simple catalog of links. The whole model is similar to the Dmoz or Yahoo catalog, except that each entry has certain additional attributes.

I have hierarchical taxonomy working on all entries with many-to-many relationship, all entries are now sorted into these categories, and everything seems to work fine. Now, what use is a catalog if there's no search option?

Here's a little bit more detail about my models: Each entry has a title, description, URL and several social profiles: YouTube, Twitter, Flickr and a couple of others. Each entry could have a logo attached to it, and a hidden field for tags. Also, the title and description are stored in three different languages. So basically I'd like the search results to be:

  1. Relevant (including taxonomy)
  2. Possibly ones with logos
  3. Possibly ones with 100% filled out profiles

I've tried Sphinx and currently working with Lucene, but it seems that I'm not getting the search right in theory. I hope it does make sense that filled entries should appear higher than the others, but I can't really figure out the scores. I wouldn't like irrelevant entries appear on top if there's simply one word match in the entire description, since titles are more relevant.

So my question is - are there any books, techniques or even other search engines (if Sphinx and Lucene are not good enough) that you would recommend for this matter? Not only I would like to get full control over search results and their ranking, but also give my visitors correct and relevant information.

Links on cool articles are appreciated too!

And No, I'm not trying to rebuild Google :)

Thanks :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

西瓜 2024-10-06 21:48:02

优秀的书:Lucene in Action(第 2 版)

当我们开始使用 Lucene 时,我们有第一版,它确实引导您一步步完成所需的一切。强烈推荐。第二版已更新为最新、最好的版本 (3.xx)。

Tf-Idf 算法在(较大的)文本上效果很好,但是如果您有类似记录的结构,则可能会适得其反:具有几个术语的文档被认为比具有许多术语的文档更“相关”。使用 Lucene,您可以让它工作,但您必须亲自动手。

您基本上需要做的是提升您的标题字段 ,因此它变得更加相关。您还可以更改评分机制 为包含更多信息的文档分配更高的分数。

玩得开心。如果您无法弄清楚,Lucene 邮件列表 上有很好的支持。

Excellent book: Lucene in Action (2nd edition)

When we started with Lucene we had the first edition, it really takes you through everything you need step by step. Highly recommended. The 2nd edition is updated for the latest and greatest version (3.x.x).

The Tf-Idf algorithm works very well on (larger) texts, but if you have a record-like structure it may backfire: the documents with a few terms are considered more "relevant" than the ones with many terms. With Lucene, you will get it to work, but you'll have to get your hands dirty.

What you'll basically have to do is boost your title field, so it becomes more relevant. You may also change the scoring mechanism to assign higher scores for documents that have more information.

Have fun. If you can't figure it out, there is excellent support on the Lucene mailinglist.

人心善变 2024-10-06 21:48:02

我很确定 Lucene 就足够了。我们已经解决了类似的任务并且做得很好。这里有一些提示,我可以建议您回顾一下我在 Lucene.Net 上的项目。

分类法:

  • 类别在数据库中表示为整数键,因此每个文档都有多个 Number 类型的字段“CATEGORY”实例。例如 document:[1,2,5,10, 'Wheel'] - 表示 Wheel 属于每个类别。

不可搜索的字段(徽标、社交资料):

  • 当然,您可以在 lucene 的非索引字段中存储不可搜索的值。但是我们已经将所有产品相关信息存储在DB中以避免重建Lucene的索引。因此 Lucene 仅拥有产品 ID 以及已索引但已存储的关键字段值。

三种语言和多个领域:

  • 我们只有两种语言。因此,不同的产品标题可以存储在同一个 Lucene 文档中,并与产品的单个 ID 相关(正如我之前所写的,ID 指的是 DB)。即使用户请求使用混合语言,您也可以搜索产品。
  • 显然,标题、标签和描述对于搜索结果具有不同的权重。 Lucene 通过分配字段权重来处理它。

I'm pretty sure that Lucene is enough. We have solved similar task and did it well. Here are some hints that I can propose you looking back at my project at Lucene.Net .

Taxonomy:

  • Category has represented as integer key in db, so each document has multiple instances of field 'CATEGORY' of type Number. For example document:[1,2,5,10, 'Wheel'] - means that wheel belongs to each of category.

Non-searchable fields (logos, social profile):

  • Of course you can store non-searchable values in lucene's non-indexed fields. But we have stored all product related information in DB to avoid rebuilding Lucene's index. So Lucene owns only by ID of product and indexed but stored values for key fields.

Three languages and multiple fields:

  • We have only 2 languages. So different titles of product can be stored in the same Lucene's document and relate to single ID of product (as I write before ID refers to DB). This allows you search product even if user request uses mix of languages.
  • Obviously title, tags and description have different weight for search result. Lucene handles it by assigning to field weight.
随波逐流 2024-10-06 21:48:02

我将尝试补充 Matthijs、Dewfy 和 Karussell 的精彩答案。
基本上,您正在尝试提高搜索相关性。
我建议您阅读 Grant Ingersoll 的 调试搜索应用程序相关性问题和他的优化 Lucene 和 Solr 中的可查找性,以及他的实用相关性幻灯片

对于不同的语言和分面,我建议您使用 Solr。它是一个使用Lucene构建的搜索引擎,易于使用。它可以通过对每种语言使用不同的 Solr Core 来支持多种语言。

I will try to add to the fine answers by Matthijs, Dewfy and Karussell.
Basically, you are trying to improve your search relevance.
I suggest you read Grant Ingersoll's Debugging Search Application Relevance Issues and his Optimizing Findability in Lucene and Solr, as well as his Practical Relevance slides.

For different languages and for faceting I suggest you use Solr. It is a search engine built using Lucene which is easy to use. It can support multiple languages by using a different Solr Core per each language.

千紇 2024-10-06 21:48:02

Lucene 或 Solr 可以完成这项工作。 Solr 构建在 lucene 之上,请参阅此处了解更多信息

会和solr一起去。下载+设置既简单又快捷。开始使用本教程和我的 链接集合。 solr 的相关性应该很好并且很容易调整。

查看 Dewfy 和 Matthijs Bierman 的回答,了解一些好的观点。

然后选择 dismax 查询处理程序,您可以选择具有某些属性的文档。

例如,对于完整配置文件的百分比,您定义一个单独的字段“profile_completness”,然后您可以将 profile_completeness 添加到 dismax 处理程序的 bf(boostfunction):配置文件越完整,这些文档的提升就越多。

我之前提到过,您可以轻松调整相关性:例如,您可以将 bf 设置为 sth。例如:bf=title^10 Tags^5 profile_completeness^1

“可能带有徽标”可以通过 boost 查询来解决:bq=logo:[* TO *]^1。其中 logo:[* TO *] 表示“仅包含字段徽标的文档”

要显示深度嵌套的类别树,您需要在内存中创建该树并通过特殊导入向 solr 提供数据。我们有一个可用的应用程序。您可以使用我们的方法

如果您需要进一步帮助,请不要犹豫发表评论。

Lucene or Solr would do the job. Solr is built on top of lucene, see here for more info

I would go with solr. download + setting it up is easy and fast. Get started with the tutorial and my link collection. Relevancy should be fine with solr and is easy tunable.

Look into Dewfy and Matthijs Bierman answer for some good points.

Then choose the dismax query handler and you can prefer docs with certain properties.

E.g. for the percentage of a full profile you define a separate field 'profile_completness' then you can add profile_completeness to bf (boostfunction) of dismax handler: the more complete the profile is the more those docs will be boosted.

I mentioned before that you can easily tune the relevancy: e.g. you can set up bf to sth. like: bf=title^10 tags^5 profile_completeness^1

"Possibly ones with logos" can be solved via boost queries: bq=logo:[* TO *]^1. Where logo:[* TO *] means "only docs which contains the field logo"

To display a deeply nested category tree you will need to create that tree in memory and feed solr with a special import. We have a working app for that. You can use our approach

If you need further assistance don't hesitate to comment.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文