减少 Sphinx 搜索时间的想法

发布于 2024-12-27 01:22:56 字数 5381 浏览 5 评论 0原文

我正在使用思考型 sphinx gem,我的查询大约需要 45 秒才能完成(1300 万条记录,包含索引的文件夹为 1.1GB)。我假设我的配置不正确(第一次使用 Sphinx 用户)。不管怎样,如果您发现任何不对劲的地方,请告诉我。这是我的配置:

define_index do
  indexes :name
  indexes :summary
  indexes :tag_list

  indexes categories.name, :as => :category_name

  has "RADIANS(lat)",  :as => :latitude,  :type => :float
  has "RADIANS(lng)",  :as => :longitude,  :type => :float

  set_property :field_weights => {
    :name           => 8,
    :summary        => 6,
    :category_name  => 5,
    :tag_list       => 3
  }
  set_property :delta => ThinkingSphinx::Deltas::ResqueDelta
  set_property :ignore_chars => %w(' -)
end

这是一个示例查询:

Location.search('Restaurant',
                :geo => [0.5837843098436726,-1.9560609568879357],
                :latitude_attr => "latitude",
                :longitude_attr => "longitude",
                :with => {"@geodist" => 0.0..4000.0},
                :include => :categories,
                :page => 1,
                :per_page => 100)

我的日志显示:

Sphinx Query (43066.3ms)  restaurant
Sphinx  Found 467 results

我将继续挖掘文档并尝试一些东西!

更新:我的development.sphinx.conf

indexer
{
}

searchd
{
    listen = 127.0.0.1:9312
    log = /project_path/log/searchd.log
    query_log = /project_path/log/searchd.query.log
    pid_file = /project_path/log/searchd.development.pid
}

source location_core_0
{
    type = pgsql
    sql_host = localhost
    sql_user = user
    sql_pass = pass
    sql_db = db_name
    sql_query_pre = UPDATE "business_entities" SET "delta" = FALSE WHERE "delta" = TRUE
    sql_query_pre = SET TIME ZONE 'UTC'
    sql_query = SELECT "business_entities"."id" * 1::INT8 + 0 AS "id" , "business_entities"."name" AS "name", "business_entities"."summary" AS "summary", "business_entities"."tag_list" AS "tag_list", "business_entities"."id" AS "sphinx_internal_id", 0 AS "sphinx_deleted", CASE COALESCE("business_entities"."type", '') WHEN 'Location' THEN 2817059741 WHEN 'Group' THEN 2885774273 WHEN 'BraintreeBusiness' THEN 28779289 WHEN 'InvoicedBusiness' THEN 1440117572 ELSE 2817059741 END AS "class_crc", COALESCE("business_entities"."type", '') AS "sphinx_internal_class", RADIANS(lat) AS "latitude", RADIANS(lng) AS "longitude" FROM "business_entities" WHERE ("business_entities"."type" = 'Location') AND ("business_entities"."id" >= $start AND "business_entities"."id" <= $end AND "business_entities"."delta" = FALSE AND "business_entities"."type" = 'Location') GROUP BY "business_entities"."id", "business_entities"."name", "business_entities"."summary", "business_entities"."tag_list", "business_entities"."id", "business_entities"."type"
    sql_query_range = SELECT COALESCE(MIN("id"), 1::bigint), COALESCE(MAX("id"), 1::bigint) FROM "business_entities" WHERE "business_entities"."delta" = FALSE
    sql_attr_uint = sphinx_internal_id
    sql_attr_uint = sphinx_deleted
    sql_attr_uint = class_crc
    sql_attr_float = latitude
    sql_attr_float = longitude
    sql_attr_string = sphinx_internal_class
    sql_query_info = SELECT * FROM "business_entities" WHERE "id" = (($id - 0) / 1)
}

index location_core
{
    source = location_core_0
    path = /project_path/db/sphinx/development/location_core
    morphology = stem_en
    charset_type = utf-8
    ignore_chars = ', -
    enable_star = 1
}

source location_delta_0 : location_core_0
{
    type = pgsql
    sql_host = localhost
    sql_user = user
    sql_pass = pass
    sql_db = db_name
    sql_query_pre = 
    sql_query_pre = SET TIME ZONE 'UTC'
    sql_query = SELECT "business_entities"."id" * 1::INT8 + 0 AS "id" , "business_entities"."name" AS "name", "business_entities"."summary" AS "summary", "business_entities"."tag_list" AS "tag_list", "business_entities"."id" AS "sphinx_internal_id", 0 AS "sphinx_deleted", CASE COALESCE("business_entities"."type", '') WHEN 'Location' THEN 2817059741 WHEN 'Group' THEN 2885774273 WHEN 'BraintreeBusiness' THEN 28779289 WHEN 'InvoicedBusiness' THEN 1440117572 ELSE 2817059741 END AS "class_crc", COALESCE("business_entities"."type", '') AS "sphinx_internal_class", RADIANS(lat) AS "latitude", RADIANS(lng) AS "longitude" FROM "business_entities" WHERE ("business_entities"."type" = 'Location') AND ("business_entities"."id" >= $start AND "business_entities"."id" <= $end AND "business_entities"."delta" = TRUE AND "business_entities"."type" = 'Location') GROUP BY "business_entities"."id", "business_entities"."name", "business_entities"."summary", "business_entities"."tag_list", "business_entities"."id", "business_entities"."type"
    sql_query_range = SELECT COALESCE(MIN("id"), 1::bigint), COALESCE(MAX("id"), 1::bigint) FROM "business_entities" WHERE "business_entities"."delta" = TRUE
    sql_attr_uint = sphinx_internal_id
    sql_attr_uint = sphinx_deleted
    sql_attr_uint = class_crc
    sql_attr_float = latitude
    sql_attr_float = longitude
    sql_attr_string = sphinx_internal_class
    sql_query_info = SELECT * FROM "business_entities" WHERE "id" = (($id - 0) / 1)
}

index location_delta : location_core
{
    source = location_delta_0
    path = /project_path/db/sphinx/development/location_delta
}

index location
{
    type = distributed
    local = location_delta
    local = location_core
}

I'm using the thinking sphinx gem my queries are taking about 45 seconds to complete (13 Million records, the folder containing the indexes is 1.1GB). I'm assuming I have something configured incorrectly (first time Sphinx user). Anyway let me know if you see anything that looks amiss. Here is my configuration:

define_index do
  indexes :name
  indexes :summary
  indexes :tag_list

  indexes categories.name, :as => :category_name

  has "RADIANS(lat)",  :as => :latitude,  :type => :float
  has "RADIANS(lng)",  :as => :longitude,  :type => :float

  set_property :field_weights => {
    :name           => 8,
    :summary        => 6,
    :category_name  => 5,
    :tag_list       => 3
  }
  set_property :delta => ThinkingSphinx::Deltas::ResqueDelta
  set_property :ignore_chars => %w(' -)
end

Here is an example query:

Location.search('Restaurant',
                :geo => [0.5837843098436726,-1.9560609568879357],
                :latitude_attr => "latitude",
                :longitude_attr => "longitude",
                :with => {"@geodist" => 0.0..4000.0},
                :include => :categories,
                :page => 1,
                :per_page => 100)

My Log shows:

Sphinx Query (43066.3ms)  restaurant
Sphinx  Found 467 results

I'll keep digging through the docs and trying stuff!

UPDATE: my development.sphinx.conf

indexer
{
}

searchd
{
    listen = 127.0.0.1:9312
    log = /project_path/log/searchd.log
    query_log = /project_path/log/searchd.query.log
    pid_file = /project_path/log/searchd.development.pid
}

source location_core_0
{
    type = pgsql
    sql_host = localhost
    sql_user = user
    sql_pass = pass
    sql_db = db_name
    sql_query_pre = UPDATE "business_entities" SET "delta" = FALSE WHERE "delta" = TRUE
    sql_query_pre = SET TIME ZONE 'UTC'
    sql_query = SELECT "business_entities"."id" * 1::INT8 + 0 AS "id" , "business_entities"."name" AS "name", "business_entities"."summary" AS "summary", "business_entities"."tag_list" AS "tag_list", "business_entities"."id" AS "sphinx_internal_id", 0 AS "sphinx_deleted", CASE COALESCE("business_entities"."type", '') WHEN 'Location' THEN 2817059741 WHEN 'Group' THEN 2885774273 WHEN 'BraintreeBusiness' THEN 28779289 WHEN 'InvoicedBusiness' THEN 1440117572 ELSE 2817059741 END AS "class_crc", COALESCE("business_entities"."type", '') AS "sphinx_internal_class", RADIANS(lat) AS "latitude", RADIANS(lng) AS "longitude" FROM "business_entities" WHERE ("business_entities"."type" = 'Location') AND ("business_entities"."id" >= $start AND "business_entities"."id" <= $end AND "business_entities"."delta" = FALSE AND "business_entities"."type" = 'Location') GROUP BY "business_entities"."id", "business_entities"."name", "business_entities"."summary", "business_entities"."tag_list", "business_entities"."id", "business_entities"."type"
    sql_query_range = SELECT COALESCE(MIN("id"), 1::bigint), COALESCE(MAX("id"), 1::bigint) FROM "business_entities" WHERE "business_entities"."delta" = FALSE
    sql_attr_uint = sphinx_internal_id
    sql_attr_uint = sphinx_deleted
    sql_attr_uint = class_crc
    sql_attr_float = latitude
    sql_attr_float = longitude
    sql_attr_string = sphinx_internal_class
    sql_query_info = SELECT * FROM "business_entities" WHERE "id" = (($id - 0) / 1)
}

index location_core
{
    source = location_core_0
    path = /project_path/db/sphinx/development/location_core
    morphology = stem_en
    charset_type = utf-8
    ignore_chars = ', -
    enable_star = 1
}

source location_delta_0 : location_core_0
{
    type = pgsql
    sql_host = localhost
    sql_user = user
    sql_pass = pass
    sql_db = db_name
    sql_query_pre = 
    sql_query_pre = SET TIME ZONE 'UTC'
    sql_query = SELECT "business_entities"."id" * 1::INT8 + 0 AS "id" , "business_entities"."name" AS "name", "business_entities"."summary" AS "summary", "business_entities"."tag_list" AS "tag_list", "business_entities"."id" AS "sphinx_internal_id", 0 AS "sphinx_deleted", CASE COALESCE("business_entities"."type", '') WHEN 'Location' THEN 2817059741 WHEN 'Group' THEN 2885774273 WHEN 'BraintreeBusiness' THEN 28779289 WHEN 'InvoicedBusiness' THEN 1440117572 ELSE 2817059741 END AS "class_crc", COALESCE("business_entities"."type", '') AS "sphinx_internal_class", RADIANS(lat) AS "latitude", RADIANS(lng) AS "longitude" FROM "business_entities" WHERE ("business_entities"."type" = 'Location') AND ("business_entities"."id" >= $start AND "business_entities"."id" <= $end AND "business_entities"."delta" = TRUE AND "business_entities"."type" = 'Location') GROUP BY "business_entities"."id", "business_entities"."name", "business_entities"."summary", "business_entities"."tag_list", "business_entities"."id", "business_entities"."type"
    sql_query_range = SELECT COALESCE(MIN("id"), 1::bigint), COALESCE(MAX("id"), 1::bigint) FROM "business_entities" WHERE "business_entities"."delta" = TRUE
    sql_attr_uint = sphinx_internal_id
    sql_attr_uint = sphinx_deleted
    sql_attr_uint = class_crc
    sql_attr_float = latitude
    sql_attr_float = longitude
    sql_attr_string = sphinx_internal_class
    sql_query_info = SELECT * FROM "business_entities" WHERE "id" = (($id - 0) / 1)
}

index location_delta : location_core
{
    source = location_delta_0
    path = /project_path/db/sphinx/development/location_delta
}

index location
{
    type = distributed
    local = location_delta
    local = location_core
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

有深☉意 2025-01-03 01:22:56

我不知道为什么它的搜索运行速度如此之慢,但我首先会简化查询中的内容,然后一点一点地增加复杂性,看看是否有任何特定的原因。所以,首先:

Location.search('Restaurant')

然后也许:

Location.search('Restaurant', :per_page => 100)

等等。不要忘记索引定义中的 :field_weights 也会产生影响。

话虽如此,我并没有发现你所做的事情有什么特别奇怪的地方,而且 43 秒的搜索时间(或任何接近的时间)是我以前从未遇到过的。

I don't know precisely why it's running so slow for a search, but I'd start by simplifying things in your query, and adding back complexity bit by bit, to see if anything in particular is the cause. So, first:

Location.search('Restaurant')

Then perhaps:

Location.search('Restaurant', :per_page => 100)

And so on. Don't forget that :field_weights in your index definition will have an effect as well.

All that said, I'm not spotting anything particularly odd with what you're doing, and 43 seconds for a search (or anything close to that) is something I've not come across before.

等风来 2025-01-03 01:22:56

我发现了我的问题 - 记录恰好位于 STI 表中,但我只想对位置类型的记录进行索引(位置没有任何后代)。在该表中的 1300 万条记录中,99.99984%(认真地)属于位置类型。 SELECT DISTINCT type FROM Business_entities 查询花费的时间太长(即使使用索引)。棘手的部分是注意到这一点,因为日志报告了持续 84 秒的 Sphinx 查询,但问题实际上是掠夺性 SQL 查询:

SQL (43647.1ms)  SELECT DISTINCT type FROM business_entities
SQL (39857.7ms)  SELECT DISTINCT type FROM business_entities

Sphinx Query (84173.0ms)  restaurant

所以我在初始化程序中修补了 Thinking Sphinx 以返回我唯一关心的类型:

module ThinkingSphinx
  class Source
    module SQL
      def type_values
        ['Location']
      end
    end
  end
end

https://gist.github.com/1603565

I found my problem - The records happen to be in an STI table but I only want to index those of type Location (Location doesn't have any descendants). Of the 13 million records in this table 99.99984% (seriously) of them are of Location type. The SELECT DISTINCT type FROM business_entities query was taking way too long (even with an index). The tricky part was noticing this since the log was reporting the Sphinx Query lasting 84 seconds but it was really the predatory SQL queries that were the problem:

SQL (43647.1ms)  SELECT DISTINCT type FROM business_entities
SQL (39857.7ms)  SELECT DISTINCT type FROM business_entities

Sphinx Query (84173.0ms)  restaurant

So I monkey patched Thinking Sphinx in an initializer to return the only type I care about:

module ThinkingSphinx
  class Source
    module SQL
      def type_values
        ['Location']
      end
    end
  end
end

https://gist.github.com/1603565

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文