ElasticSearch、Sphinx、Lucene、Solr、Xapian。哪个适合哪个用途?

发布于 2024-08-21 05:57:11 字数 230 浏览 19 评论 0原文

我目前正在寻找其他搜索方法,而不是进行大量的 SQL 查询。 我最近看到了 elasticsearch 并使用了 whoosh(搜索引擎的 Python 实现)。

您能给出您选择的理由吗?

I'm currently looking at other search methods rather than having a huge SQL query.
I saw elasticsearch recently and played with whoosh (a Python implementation of a search engine).

Can you give reasons for your choice(s)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

话少心凉 2024-08-28 05:57:12

尝试索引罐。

就 Elasticsearch 而言,它被认为比 lucene/solr 更容易使用。它还包括非常灵活的评分系统,可以在不重新索引的情况下进行调整。

Try indextank.

As the case of elastic search, it was conceived to be much easier to use than lucene/solr. It also includes very flexible scoring system that can be tweaked without reindexing.

两个我 2024-08-28 05:57:12

Lucene 很好,但是他们的停用词集很糟糕。我必须手动向 StopAnalyzer.ENGLISH_STOP_WORDS_SET 添加大量停用词才能使其接近可用。

我没有使用过 Sphinx,但我知道人们对它的速度和近乎神奇的“易于设置与令人敬畏”的比率发誓。

Lucene is nice and all, but their stop word set is awful. I had to manually add a ton of stop words to StopAnalyzer.ENGLISH_STOP_WORDS_SET just to get it anywhere near usable.

I haven't used Sphinx but I know people swear by its speed and near-magical "ease of setup to awesomeness" ratio.

等待圉鍢 2024-08-28 05:57:11

作为 ElasticSearch 的创建者,也许我可以给你一些关于我为什么继续创建它的推理:)。

使用纯 Lucene 具有挑战性。如果你想让它真正表现良好,你需要注意很多事情,而且它是一个库,所以没有分布式支持,它只是一个你需要维护的嵌入式 Java 库。

就 Lucene 的可用性而言,早在我创建 Compass 的时候(现在已经快 6 年了)。其目标是简化 Lucene 的使用,让日常 Lucene 变得更简单。我一次又一次遇到的要求是能够分发 Compass。我开始在 Compass 内部进行研究,通过与 GigaSpaces、Coherence 和 Terracotta 等数据网格解决方案集成,但这还不够。

分布式 Lucene 解决方案的核心是需要分片。此外,随着 HTTP 和 JSON 作为无处不在的 API 的进步,这意味着可以轻松使用具有不同语言的许多不同系统的解决方案。

这就是我继续创建 ElasticSearch 的原因。它具有非常先进的分布式模型,原生使用 JSON,并公开许多高级搜索功能,所有这些功能都通过 JSON DSL 无缝表达。

Solr 也是通过 HTTP 公开索引/搜索服务器的解决方案,但我认为 ElasticSearch 提供了很多功能优越的分布式模型和易用性(虽然目前缺乏一些搜索功能,但不会太久,无论如何,计划是将所有Compass功能纳入ElasticSearch)。当然,我有偏见,因为我创建了 ElasticSearch,所以你可能需要自己检查一下。

至于Sphinx,我没有用过,所以无法评论。我可以向您推荐的是Sphinx 论坛上的这个帖子,我认为这证明了ElasticSearch 的优越分布式模型。

当然,ElasticSearch 的功能远不止分布式。它实际上是在考虑云的情况下构建的。您可以在网站上查看功能列表。

As the creator of ElasticSearch, maybe I can give you some reasoning on why I went ahead and created it in the first place :).

Using pure Lucene is challenging. There are many things that you need to take care for if you want it to really perform well, and also, its a library, so no distributed support, it's just an embedded Java library that you need to maintain.

In terms of Lucene usability, way back when (almost 6 years now), I created Compass. Its aim was to simplify using Lucene and make everyday Lucene simpler. What I came across time and time again is the requirement to be able to have Compass distributed. I started to work on it from within Compass, by integrating with data grid solutions like GigaSpaces, Coherence, and Terracotta, but it's not enough.

At its core, a distributed Lucene solution needs to be sharded. Also, with the advancement of HTTP and JSON as ubiquitous APIs, it means that a solution that many different systems with different languages can easily be used.

This is why I went ahead and created ElasticSearch. It has a very advanced distributed model, speaks JSON natively, and exposes many advanced search features, all seamlessly expressed through JSON DSL.

Solr is also a solution for exposing an indexing/search server over HTTP, but I would argue that ElasticSearch provides a much superior distributed model and ease of use (though currently lacking on some of the search features, but not for long, and in any case, the plan is to get all Compass features into ElasticSearch). Of course, I am biased, since I created ElasticSearch, so you might need to check for yourself.

As for Sphinx, I have not used it, so I can't comment. What I can refer you is to this thread at Sphinx forum which I think proves the superior distributed model of ElasticSearch.

Of course, ElasticSearch has many more features than just being distributed. It is actually built with a cloud in mind. You can check the feature list on the site.

寄人书 2024-08-28 05:57:11

我使用过 Sphinx、Solr 和 Elasticsearch。 Solr/Elasticsearch 构建在 Lucene 之上。它添加了许多常见功能:Web 服务器 api、分面、缓存等。

如果您只想进行简单的全文搜索设置,Sphinx 是更好的选择。

如果您想自定义搜索,Elasticsearch 和 Solr 是更好的选择。它们具有很强的可扩展性:您可以编写自己的插件来调整结果评分。

一些示例用法:

  • Sphinx:craigslist.org
  • Solr:Cnet、Netflix、digg.com
  • Elasticsearch:Foursquare、Github

I have used Sphinx, Solr and Elasticsearch. Solr/Elasticsearch are built on top of Lucene. It adds many common functionality: web server api, faceting, caching, etc.

If you want to just have a simple full text search setup, Sphinx is a better choice.

If you want to customize your search at all, Elasticsearch and Solr are the better choices. They are very extensible: you can write your own plugins to adjust result scoring.

Some example usages:

  • Sphinx: craigslist.org
  • Solr: Cnet, Netflix, digg.com
  • Elasticsearch: Foursquare, Github
天邊彩虹 2024-08-28 05:57:11

我们定期使用 Lucene 来索引和
搜索数千万文档。
搜索速度足够快,我们使用
增量更新不需要
很长一段时间。我们确实花了一些时间
到达这里。优点
Lucene的可扩展性很大
一系列的功能和活跃的
开发者社区。使用裸机
Lucene 需要使用 Java 进行编程。

如果您是重新开始,Lucene 系列中适合您的工具是 Solr,它更容易使用比裸露的Lucene设置好,并且拥有几乎所有Lucene的功能。它可以轻松导入数据库文档。 Solr 是用 Java 编写的,因此对 Solr 的任何修改都需要 Java 知识,但只需调整配置文件就可以做很多事情。

我还听说过有关 Sphinx 的好消息,尤其是与 MySQL 数据库结合使用时。不过还没用过。

IMO,您应该根据以下因素进行选择:

  • 所需的功能 - 例如,您需要法语词干分析器吗? Lucene和Solr都有一个,其他的我不知道。
  • 熟练掌握实现语言 - 如果您不懂 Java,请不要接触 Java Lucene。您可能需要 C++ 才能使用 Sphinx 进行操作。 Lucene 还被移植到 其他 语言。如果您想扩展搜索引擎,这一点非常重要。
  • 易于实验——我相信 Solr 在这方面是最好的。
  • 与其他软件的接口 - Sphinx 与 MySQL 具有良好的接口。 Solr 支持 ruby​​、XML 和 JSON 接口作为 RESTful 服务器。 Lucene 只允许您通过 Java 进行编程访问。 指南针Hibernate Search 是 Lucene 的包装器,将其集成到更大的框架中。

We use Lucene regularly to index and
search tens of millions of documents.
Searches are quick enough, and we use
incremental updates that do not take
a long time. It did take us some time
to get here. The strong points of
Lucene are its scalability, a large
range of features and an active
community of developers. Using bare
Lucene requires programming in Java.

If you are starting afresh, the tool for you in the Lucene family is Solr, which is much easier to set up than bare Lucene, and has almost all of Lucene's power. It can import database documents easily. Solr are written in Java, so any modification of Solr requires Java knowledge, but you can do a lot just by tweaking configuration files.

I have also heard good things about Sphinx, especially in conjunction with a MySQL database. Have not used it, though.

IMO, you should choose according to:

  • The required functionality - e.g. do you need a French stemmer? Lucene and Solr have one, I do not know about the others.
  • Proficiency in the implementation language - Do not touch Java Lucene if you do not know Java. You may need C++ to do stuff with Sphinx. Lucene has also been ported into other languages. This is mostly important if you want to extend the search engine.
  • Ease of experimentation - I believe Solr is best in this aspect.
  • Interfacing with other software - Sphinx has a good interface with MySQL. Solr supports ruby, XML and JSON interfaces as a RESTful server. Lucene only gives you programmatic access through Java. Compass and Hibernate Search are wrappers of Lucene that integrate it into larger frameworks.
悟红尘 2024-08-28 05:57:11

我们在垂直搜索项目中使用 Sphinx,该项目包含 10.000.000 多条 MySql 记录和 10 多个不同的数据库。
它对 MySQL 有很好的支持,索引性能也很高,研究速度很快,但可能比 Lucene 稍差一些。
然而,如果您每天需要快速建立索引并使用 MySQL 数据库,那么它是正确的选择。

We use Sphinx in a Vertical Search project with 10.000.000 + of MySql records and 10+ different database .
It has got very excellent support for MySQL and high performance on indexing , research is fast but maybe a little less than Lucene.
However it's the right choice if you need quickly indexing every day and use a MySQL db.

桃扇骨 2024-08-28 05:57:11

我的 sphinx.conf

source post_source 
{
    type = mysql

    sql_host = localhost
    sql_user = ***
    sql_pass = ***
    sql_db =   ***
    sql_port = 3306

    sql_query_pre = SET NAMES utf8
    # query before fetching rows to index

    sql_query = SELECT *, id AS pid, CRC32(safetag) as safetag_crc32 FROM hb_posts


    sql_attr_uint = pid  
    # pid (as 'sql_attr_uint') is necessary for sphinx
    # this field must be unique

    # that is why I like sphinx
    # you can store custom string fields into indexes (memory) as well
    sql_field_string = title
    sql_field_string = slug
    sql_field_string = content
    sql_field_string = tags

    sql_attr_uint = category
    # integer fields must be defined as sql_attr_uint

    sql_attr_timestamp = date
    # timestamp fields must be defined as sql_attr_timestamp

    sql_query_info_pre = SET NAMES utf8
    # if you need unicode support for sql_field_string, you need to patch the source
    # this param. is not supported natively

    sql_query_info = SELECT * FROM my_posts WHERE id = $id
}

index posts 
{
    source = post_source
    # source above

    path = /var/data/posts
    # index location

    charset_type = utf-8
}

测试脚本:

<?php

    require "sphinxapi.php";

    $safetag = $_GET["my_post_slug"];
//  $safetag = preg_replace("/[^a-z0-9\-_]/i", "", $safetag);

    $conf = getMyConf();

    $cl = New SphinxClient();

    $cl->SetServer($conf["server"], $conf["port"]);
    $cl->SetConnectTimeout($conf["timeout"]);
    $cl->setMaxQueryTime($conf["max"]);

    # set search params
    $cl->SetMatchMode(SPH_MATCH_FULLSCAN);
    $cl->SetArrayResult(TRUE);

    $cl->setLimits(0, 1, 1); 
    # looking for the post (not searching a keyword)

    $cl->SetFilter("safetag_crc32", array(crc32($safetag)));

    # fetch results
    $post = $cl->Query(null, "post_1");

    echo "<pre>";
    var_dump($post);
    echo "</pre>";
    exit("done");
?>

示例结果:

[array] => 
  "id" => 123,
  "title" => "My post title.",
  "content" => "My <p>post</p> content.",
   ...
   [ and other fields ]

Sphinx 查询时间:

0.001 sec.

Sphinx 查询时间(1k 并发):

=> 0.346 sec. (average)
=> 0.340 sec. (average of last 10 query)

MySQL 查询时间:

"SELECT * FROM hb_posts WHERE id = 123;"
=> 0.001 sec.

MySQL 查询时间(1k 并发):

"SELECT * FROM my_posts WHERE id = 123;" 
=> 1.612 sec. (average)
=> 1.920 sec. (average of last 10 query)

My sphinx.conf

source post_source 
{
    type = mysql

    sql_host = localhost
    sql_user = ***
    sql_pass = ***
    sql_db =   ***
    sql_port = 3306

    sql_query_pre = SET NAMES utf8
    # query before fetching rows to index

    sql_query = SELECT *, id AS pid, CRC32(safetag) as safetag_crc32 FROM hb_posts


    sql_attr_uint = pid  
    # pid (as 'sql_attr_uint') is necessary for sphinx
    # this field must be unique

    # that is why I like sphinx
    # you can store custom string fields into indexes (memory) as well
    sql_field_string = title
    sql_field_string = slug
    sql_field_string = content
    sql_field_string = tags

    sql_attr_uint = category
    # integer fields must be defined as sql_attr_uint

    sql_attr_timestamp = date
    # timestamp fields must be defined as sql_attr_timestamp

    sql_query_info_pre = SET NAMES utf8
    # if you need unicode support for sql_field_string, you need to patch the source
    # this param. is not supported natively

    sql_query_info = SELECT * FROM my_posts WHERE id = $id
}

index posts 
{
    source = post_source
    # source above

    path = /var/data/posts
    # index location

    charset_type = utf-8
}

Test script:

<?php

    require "sphinxapi.php";

    $safetag = $_GET["my_post_slug"];
//  $safetag = preg_replace("/[^a-z0-9\-_]/i", "", $safetag);

    $conf = getMyConf();

    $cl = New SphinxClient();

    $cl->SetServer($conf["server"], $conf["port"]);
    $cl->SetConnectTimeout($conf["timeout"]);
    $cl->setMaxQueryTime($conf["max"]);

    # set search params
    $cl->SetMatchMode(SPH_MATCH_FULLSCAN);
    $cl->SetArrayResult(TRUE);

    $cl->setLimits(0, 1, 1); 
    # looking for the post (not searching a keyword)

    $cl->SetFilter("safetag_crc32", array(crc32($safetag)));

    # fetch results
    $post = $cl->Query(null, "post_1");

    echo "<pre>";
    var_dump($post);
    echo "</pre>";
    exit("done");
?>

Sample result:

[array] => 
  "id" => 123,
  "title" => "My post title.",
  "content" => "My <p>post</p> content.",
   ...
   [ and other fields ]

Sphinx query time:

0.001 sec.

Sphinx query time (1k concurrent):

=> 0.346 sec. (average)
=> 0.340 sec. (average of last 10 query)

MySQL query time:

"SELECT * FROM hb_posts WHERE id = 123;"
=> 0.001 sec.

MySQL query time (1k concurrent):

"SELECT * FROM my_posts WHERE id = 123;" 
=> 1.612 sec. (average)
=> 1.920 sec. (average of last 10 query)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文