使用ARC2,文本数据会被损坏。
我的 RDF 输入文件采用 UTF-8 格式。它通过 LOAD
查询加载到使用 MySQL 后端的 ARC2 中。 MySQL 数据库也是 UTF-8 格式,通过 PHPMyAdmin 检查可以确定。
然而,文本数据被损坏。经过多次转换检查,问题似乎是原始UTF-8文件被认为是ISO-8859-1,并再次转换为UTF-8。
示例:“surmonté”→“surmonteÌ”。
此“surmonteÌ”实际上在数据库中以 UTF-8 格式提供。
这是否与 ARC2 打开文件的方式有关(深入研究代码,不是详尽但相当深入,没有显示任何可疑的内容),或者这可能是 PHP 和 MySQL 的更常见情况?
如何确保导入的数据不会被错误地重新编码,而是被视为原始数据?
Using ARC2, textual data gets corrupted.
My RDF input file is in UTF-8. It gets loaded in ARC2, which uses a MySQL backend, through a LOAD <path/to/file.rdf>
query. The MySQL database is in UTF-8 too, as a check with PHPMyAdmin makes sure.
However, the textual data gets corrupted. After several conversion checks, the problem seems to be that the original UTF-8 file is believed to be in ISO-8859-1, and converted to UTF-8 once again.
Example: "surmonté" → "surmonteÌ".
This "surmonteÌ" is actulally available in UTF-8 in the database.
Is this related to the way ARC2 opens files (digging through the code, not exhaustively but quite deep, did not show anything suspicious), or could this be a more general case with PHP and MySQL?
How can I make sure the imported data is not wrongly re-encoded but taken as the original?
发布评论
评论(1)
ARC2 使用两个函数:
$store->setUp()
,其中CREATE
为TABLE
和DATABASE
if需要是;和query(LOAD...
,问题中的详细信息。事实证明,
setUp()
部分必须不在同一脚本中调用至少,不是在同一个执行期间。我采取的解决方案是制作两个单独的脚本,一个用于初始化数据库,另一个用于加载数据,但只是注释掉 init。无论如何,完成后的部分也有效。确保初始化后不会立即进行加载。发生这种情况是因为仅设置了数据库连接时的
SET NAMES utf8
编码规范排序规则检测之后,如果数据库刚刚创建,MySQL 似乎无法正确检测。我做了一个 修复请求附带说明,使用
效率不高。 LOAD 问题的构造:这将被计算为相对web地址,调用服务器通过网络从自身下载。使用如下结构会更有效:
ARC2 uses two functions:
$store->setUp()
, whichCREATE
sTABLE
s andDATABASE
if needs be; andquery(LOAD…
, a detailed in the question.It turns out, the
setUp()
part must not be called in the same script as theload
part. At least, not during the same execution. The solution I took was to make two separate scripts, one to init the database, another to load the data, but simply commenting out the init part once it is done also works. In any case, the trick is to make sure the loading won't take place right after the initialization.This happens because the
SET NAMES utf8
encoding specification upon DB connection is set only after collation detection, for which MySQL does not seem to detect properly if the database has just been created. I made a pull request of a fix.As a side note, it is not efficient to use the
LOAD <path/to/file.rdf
construct of the question: this will be computed as a relative web address, calling the server to download from itself through the network. It is much more efficient to use a construct such as: