具有 UTF-8 字符串属性的 Sphinx

发布于 2025-01-04 06:32:07 字数 3847 浏览 1 评论 0原文

我有一个包含 UTF-8 文档的 Sphinx 索引,特别是艺术家的姓名。由于各种原因,我们将名称既作为字段(indexed_name)又作为属性(name)。当我搜索文档时,我找到了它,但返回的属性已损坏:

mysql> select name from artist where match('@indexed_name Sánchez') limit 3;
+---------+--------+-----------------------+
| id      | weight | name                  |
+---------+--------+-----------------------+
| 7843884 |   2642 | Sarita Sánchez     |
| 8519538 |   2642 | Cristhian  Sánchez |
| 3853986 |   2627 | Alfonso  Sánchez   |
+---------+--------+-----------------------+
3 rows in set (0.02 sec)

看起来属性最初是 UTF-8,但被视为 ISO-8859-1,然后转换回 UTF-8。当我在 Ruby 中执行此操作时,看起来它会第二次执行:

[1] pry(main)> rs = Thebes::Sphinxql::Query.run("select name from artist where match('@indexed_name Sánchez')")
=> #<Mysql2::Result:0x000000029bebf8 (omitted...)
[2] pry(main)> name = rs.first['name']
=> "Sarita SÃ\u0083¡nchez"

这是 Sphinx 中的错误,还是我做错了什么?

我可以通过 ISO-8859-1 和 UTF-8 循环来反转它:

[4] pry(main)> name.encode!("ISO-8859-1")
=> "Sarita S\xC3\x83\xC2\xA1nchez"
[5] pry(main)> name.force_encoding("UTF-8")
=> "Sarita Sánchez"
[6] pry(main)> name.encode!("ISO-8859-1")
=> "Sarita S\xC3\xA1nchez"
[7] pry(main)> name.force_encoding("UTF-8")
=> "Sarita Sánchez"

但是,对于其他 ISO-8859-* 字符集中的字符以及合法需要 Unicode 的内容,这是否有效?

更新1:

第二个问题的答案是否定的。搜索土耳其语名字:

mysql> select name from artist where match('@indexed_name ÖZDEMİR') limit 3;

+---------+--------+-------------------------------+
| id      | weight | name                          |
+---------+--------+-------------------------------+
| 1753230 |   2664 | Nurullah Alper ÖZDEMİR |
| 6973956 |   2664 | YİĞİT ÖZDEMİR |
| 9133770 |   2664 | TAHA ÖZDEMİR           |
+---------+--------+-------------------------------+
3 rows in set (0.01 sec)

第二个应该是“YıĞіT ÖZDEMIIR”。

[2] pry(main)> rs = Thebes::Sphinxql::Query.run("select name from artist where match('@indexed_name ÖZDEMİR') limit 3")
=> #<Mysql2::Result:0x000000047779b0...
[5] pry(main)> name = rs.to_a[1]['name'].dup
=> "YÃ\u0084°Ã\u0084žÃ\u0084°T Ã\u0083â\u0080\u0093ZDEMÃ\u0084°R"
[6] pry(main)> name.encode!("ISO-8859-1")
=> "Y\xC3\x84\xC2\xB0\xC3\x84\xC5\xBE\xC3\x84\xC2\xB0T \xC3\x83\xE2\x80\x93ZDEM\xC3\x84\xC2\xB0R"
[7] pry(main)> name.force_encoding("UTF-8")
=> "YİĞİT ÖZDEMİR"
[8] pry(main)> name.encode!("ISO-8859-1")
Encoding::UndefinedConversionError: U+017E from UTF-8 to ISO-8859-1
from (pry):8:in `encode!'

我不知道 Ö 是如何变成 ⁄ 的,它似乎有 5 个字节宽...

更新 2:

我不想发布我的整个 sphinx.conf,但这是此处使用的索引的配置。它是由 Thinking Sphinx 生成的。

source artist_core_0
{
  type = mysql
  sql_host = (omitted)
  sql_user = (omitted)
  sql_pass = (omitted)
  sql_db = (omitted)
  sql_query_pre = SET NAMES utf8
  sql_query_pre = SET TIME_ZONE = '+0:00'
  sql_query = (omitted)
  sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1) FROM `artists` 
  sql_attr_uint = sphinx_internal_id
  sql_attr_uint = sphinx_deleted
  sql_attr_uint = class_crc
  sql_attr_float = latitude
  sql_attr_float = longitude
  sql_attr_string = sphinx_internal_class
  sql_attr_string = name
  sql_attr_string = homepage
  sql_attr_string = image
  sql_attr_string = city
  sql_attr_string = state
  sql_attr_string = postal_code
  sql_attr_string = country
  sql_query_info = SELECT * FROM `artists` WHERE `id` = (($id - 0) / 6)
}

index artist_core
{
  source = artist_core_0
  path = (omitted)
  morphology = libstemmer_en, libstemmer_fr, libstemmer_tr, libstemmer_es, libstemmer_de, libstemmer_it
  charset_type = utf-8
  min_prefix_len = 3
  enable_star = 1
}

index artist
{
  type = distributed
  local = artist_core
}

I have a Sphinx index with UTF-8 documents, in particular the names of artists. For various reasons, we have the name both as a field (indexed_name) and as an attribute (name). When I search for a document, I find it correctly, but the attribute is being returned corrupted:

mysql> select name from artist where match('@indexed_name Sánchez') limit 3;
+---------+--------+-----------------------+
| id      | weight | name                  |
+---------+--------+-----------------------+
| 7843884 |   2642 | Sarita Sánchez     |
| 8519538 |   2642 | Cristhian  Sánchez |
| 3853986 |   2627 | Alfonso  Sánchez   |
+---------+--------+-----------------------+
3 rows in set (0.02 sec)

It looks like the attributes were originally UTF-8 but were treated as ISO-8859-1 and then converted back to UTF-8. When I do this in Ruby, it looks like it goes through it a second time:

[1] pry(main)> rs = Thebes::Sphinxql::Query.run("select name from artist where match('@indexed_name Sánchez')")
=> #<Mysql2::Result:0x000000029bebf8 (omitted...)
[2] pry(main)> name = rs.first['name']
=> "Sarita SÃ\u0083¡nchez"

Is this a bug in Sphinx, or am I doing something wrong?

I can reverse it by cycling it through ISO-8859-1 and UTF-8:

[4] pry(main)> name.encode!("ISO-8859-1")
=> "Sarita S\xC3\x83\xC2\xA1nchez"
[5] pry(main)> name.force_encoding("UTF-8")
=> "Sarita Sánchez"
[6] pry(main)> name.encode!("ISO-8859-1")
=> "Sarita S\xC3\xA1nchez"
[7] pry(main)> name.force_encoding("UTF-8")
=> "Sarita Sánchez"

Is that going to work, though, for characters in other ISO-8859-* character sets and for things that legitimately need Unicode?

Update 1:

The answer to the second question is no. Searching for Turkish names:

mysql> select name from artist where match('@indexed_name ÖZDEMİR') limit 3;

+---------+--------+-------------------------------+
| id      | weight | name                          |
+---------+--------+-------------------------------+
| 1753230 |   2664 | Nurullah Alper ÖZDEMİR |
| 6973956 |   2664 | YİĞİT ÖZDEMİR |
| 9133770 |   2664 | TAHA ÖZDEMİR           |
+---------+--------+-------------------------------+
3 rows in set (0.01 sec)

The second one there is supposed to be "YİĞİT ÖZDEMİR."

[2] pry(main)> rs = Thebes::Sphinxql::Query.run("select name from artist where match('@indexed_name ÖZDEMİR') limit 3")
=> #<Mysql2::Result:0x000000047779b0...
[5] pry(main)> name = rs.to_a[1]['name'].dup
=> "YÃ\u0084°Ã\u0084žÃ\u0084°T Ã\u0083â\u0080\u0093ZDEMÃ\u0084°R"
[6] pry(main)> name.encode!("ISO-8859-1")
=> "Y\xC3\x84\xC2\xB0\xC3\x84\xC5\xBE\xC3\x84\xC2\xB0T \xC3\x83\xE2\x80\x93ZDEM\xC3\x84\xC2\xB0R"
[7] pry(main)> name.force_encoding("UTF-8")
=> "YİĞİT ÖZDEMİR"
[8] pry(main)> name.encode!("ISO-8859-1")
Encoding::UndefinedConversionError: U+017E from UTF-8 to ISO-8859-1
from (pry):8:in `encode!'

I'm not sure how Ö got turned in to Ö, which appears to be five bytes wide...

Update 2:

I don't want to post my whole sphinx.conf, but here's the config for the index that's being used here. It's generated by Thinking Sphinx.

source artist_core_0
{
  type = mysql
  sql_host = (omitted)
  sql_user = (omitted)
  sql_pass = (omitted)
  sql_db = (omitted)
  sql_query_pre = SET NAMES utf8
  sql_query_pre = SET TIME_ZONE = '+0:00'
  sql_query = (omitted)
  sql_query_range = SELECT IFNULL(MIN(`id`), 1), IFNULL(MAX(`id`), 1) FROM `artists` 
  sql_attr_uint = sphinx_internal_id
  sql_attr_uint = sphinx_deleted
  sql_attr_uint = class_crc
  sql_attr_float = latitude
  sql_attr_float = longitude
  sql_attr_string = sphinx_internal_class
  sql_attr_string = name
  sql_attr_string = homepage
  sql_attr_string = image
  sql_attr_string = city
  sql_attr_string = state
  sql_attr_string = postal_code
  sql_attr_string = country
  sql_query_info = SELECT * FROM `artists` WHERE `id` = (($id - 0) / 6)
}

index artist_core
{
  source = artist_core_0
  path = (omitted)
  morphology = libstemmer_en, libstemmer_fr, libstemmer_tr, libstemmer_es, libstemmer_de, libstemmer_it
  charset_type = utf-8
  min_prefix_len = 3
  enable_star = 1
}

index artist
{
  type = distributed
  local = artist_core
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

蓝眸 2025-01-11 06:32:07

没关系。我们数据库中的数据是双重编码的。

Never mind. The data in our database was double-encoded.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文