使用 Solr 索引包含 utf8 数据的 latin1 列
我正在开发一个基于 PHP 的 Web 应用程序,该应用程序有一个现有的 MySQL 数据库,其中所有文本列都使用 latin1 编码,但其中包含 utf8 数据。
这对于 PHP 应用程序来说效果很好,它使用 latin1 编码进行数据库连接来检索数据,然后直接输出到浏览器,告诉浏览器页面是 utf8。
然而,我们现在尝试使用 Solr 来索引数据库,使用 MySQL JDBC 连接器,但我们在索引中得到了乱码。
这是连接字符串:
<dataSource
url="jdbc:mysql://localhost/db_name?characterEncoding=latin1&characterSetResults=utf8"
user="user" password="password" />
有没有办法让 Solr 将数据检索为 latin1,然后将其视为 utf8 而不进行转换?我尝试更改 JDBC 连接字符串中的characterEncoding 和characterSetResults 参数,但无济于事。
我确信解决这个问题的正确方法是将所有文本列转换为 utf8,但我不想走这条路(目前),因为它会破坏 web 应用程序。
I'm working on a PHP-based webapp that has an existing MySQL database where all text columns use latin1 encoding, but they have utf8 data in them.
This works fine for the PHP app, which uses latin1 encoding for the db connection to retrieve the data then outputs directly to the browser, telling the browser the page is utf8.
However, we are now trying to use Solr to index the database, using the MySQL JDBC connector, and we are getting gibberish words in the index.
Here is the connection string:
<dataSource
url="jdbc:mysql://localhost/db_name?characterEncoding=latin1&characterSetResults=utf8"
user="user" password="password" />
Is there a way to get Solr to retrieve the data as latin1, and then treat it as utf8 without converting it? I've tried changing the characterEncoding and characterSetResults parameters in the JDBC connection string, to no avail.
I'm sure the right way to fix this is to convert all the text columns to utf8, but I'd prefer not to go that route (yet) because it will break the webapp.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你是对的,正确的方法是修复你的数据库。相信我,我刚刚经历过这个(为了修复 solr 安装,奇怪的是),你越早修复它,它给你带来的痛苦就越少。
有两种方法可以修复以错误编码存储的数据:
mysqldump
将所有数据设为 latin1 (--default-character-set=latin1
),编辑表定义为CHARACTER SET=utf8
而不是latin
,然后重新导入转储。为每个错误编码的列运行两个 alter 语句
ALTER 表修改列 BINARY
ALTER TABLE MODIFY 列 VARCHAR(255) CHARSET utf8
通过首先修改为 BINARY,您可以欺骗 MySQL 不运行 latin1=>utf8 转换,在您的情况下,这会双重编码并破坏您的数据。
如果您的数据库很大,1 将比 2 更快。
之后,只需将 PHP 应用程序配置为使用 utf8 与 MySQL 进行通信即可。这可以通过发送
SET NAMES utf8
作为每个连接的第一个查询来完成。如果 PHP 已经将 utf8 数据发送到 MySQL,那么您需要做的就是这些。如果,在这个措辞奇妙且详细的答案之后,您仍然在寻找破解方法...尝试将 JDBC 配置为以 UTF8 连接,但看看是否有办法让它在执行任何工作之前发送预命令。
如果您执行该命令:
SET NAMES latin1
,那么理论上,Solr 应该以 UTF8 方式连接,并以 UTF8 方式处理来自该连接的数据,但 MySQL 会将该连接视为 latin1 连接,并且不会转换任何传入的数据来自你的 latin1 专栏。You're correct, the right way is to fix your database. Trust me, I've just gone through this (to fix a solr installation, oddly enough) and the sooner you can fix it the less pain it will cause you.
There's two ways to fix your data stored in the wrong encoding:
mysqldump
all the data as latin1 (--default-character-set=latin1
), edit the table definitions to beCHARACTER SET=utf8
instead oflatin
and then reimport the dump.Run two alter statements for each mis-encoded column
ALTER table MODIFY columns BINARY
ALTER TABLE MODIFY column VARCHAR(255) CHARSET utf8
by modifying to BINARY first, you can trick MySQL in to not running the latin1=>utf8 conversion which, in your case, would double-encode and break your data.
If your db is large, 1 is going to be faster than 2.
Immediately after that, just configure your PHP application to communicate with MySQL using utf8. This can be done by sending
SET NAMES utf8
as the first query of every connection. If PHP is already sending utf8 data to MySQL, that should be all you need to do.If, after this fantastically worded and detailed answer you're still looking for a hack... Try and configure JDBC to connect as UTF8 but see if there's a way to get it to send a pre-command before it does any work.
If you make that command:
SET NAMES latin1
then, theoretically, Solr should connect as UTF8 and process data coming from the connection as UTF8, but MySQL will consider the connection a latin1 connection and not convert any data coming from your latin1 columns.