mysql 转储中出现非常奇怪的字符——该怎么办?
我的数据迁移受到了奇怪的数据的影响。这些奇怪的字符按原样嵌入到实际的 mysql 转储文件中:
北京东方å›悦大酒店<br />\n<br />\n“The impetus
我已经得到了 mysql 数据转储,其中包含这些类型的字符。我将数据导入 Drupal,首先重新创建 mysql 表,然后使用 Drupal 的 Migrate 模块查询它们。
代码如下所示:
DROP TABLE IF EXISTS `news`;
SET @saved_cs_client = @@character_set_client;
SET character_set_client = utf8;
CREATE TABLE `news` (
`id` int(11) NOT NULL auto_increment,
`uid` int(11) NOT NULL,
`pid` int(11) default NULL,
`puid` int(11) default NULL,
`headline` varchar(255) NOT NULL,
`teaser` varchar(500) NOT NULL,
`status` char(1) default NULL,
`date` datetime NOT NULL,
`url` varchar(255) default NULL,
`url_title` varchar(255) default NULL,
`body` text,
`caption` varchar(255) default NULL,
`gid` int(11) default NULL,
`feature` text,
`related` varchar(255) default NULL,
`change1_time` int(11) default NULL,
`change2_time` int(11) default NULL,
`change1_user` varchar(255) default NULL,
`change2_user` varchar(255) default NULL,
`expires` datetime default NULL,
`rank` char(1) default NULL,
PRIMARY KEY (`id`),
KEY `uid` (`uid`),
KEY `status` (`status`),
KEY `expires` (`expires`),
KEY `rank` (`rank`),
KEY `puid` (`puid`),
FULLTEXT KEY `headline` (`headline`,`teaser`,`body`)
) ENGINE=MyISAM AUTO_INCREMENT=6976 DEFAULT CHARSET=utf8;
SET character_set_client = @saved_cs_client;
最快的解决方案是这里的赢家——我的截止日期很紧,而且在这里真的很痛苦!我尝试过搜索和替换解决方案,但似乎有太多不同类型的奇怪数据。如果我知道要告诉他们什么(如何进行数据转储),我可以编排新的数据转储。
谢谢, 约翰
I've got weird data mangling my data migration. These weird characters are embedded as-is in the actual mysql dump file:
北京东方å›悦大酒店<br />\n<br />\n“The impetus
I've been given mysql data dumps with those kinds of chars in them. I'm importing the data into Drupal, by first recreating the mysql tables, and then querying against them using Drupal's Migrate module.
Code looks like this:
DROP TABLE IF EXISTS `news`;
SET @saved_cs_client = @@character_set_client;
SET character_set_client = utf8;
CREATE TABLE `news` (
`id` int(11) NOT NULL auto_increment,
`uid` int(11) NOT NULL,
`pid` int(11) default NULL,
`puid` int(11) default NULL,
`headline` varchar(255) NOT NULL,
`teaser` varchar(500) NOT NULL,
`status` char(1) default NULL,
`date` datetime NOT NULL,
`url` varchar(255) default NULL,
`url_title` varchar(255) default NULL,
`body` text,
`caption` varchar(255) default NULL,
`gid` int(11) default NULL,
`feature` text,
`related` varchar(255) default NULL,
`change1_time` int(11) default NULL,
`change2_time` int(11) default NULL,
`change1_user` varchar(255) default NULL,
`change2_user` varchar(255) default NULL,
`expires` datetime default NULL,
`rank` char(1) default NULL,
PRIMARY KEY (`id`),
KEY `uid` (`uid`),
KEY `status` (`status`),
KEY `expires` (`expires`),
KEY `rank` (`rank`),
KEY `puid` (`puid`),
FULLTEXT KEY `headline` (`headline`,`teaser`,`body`)
) ENGINE=MyISAM AUTO_INCREMENT=6976 DEFAULT CHARSET=utf8;
SET character_set_client = @saved_cs_client;
Fastest solution is the winner here -- I'm on a tight deadline, and really suffering over here! I've tried a search and replace solution, but there appear to be too many different types of weird data. I can orchestrate a new data dump, if I know what to tell them (how to do the data dump).
Thanks,
John
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这不是对你的问题的直接回答,但我玩了一下你引用的 mojibake在你的帖子中。看起来原来是 UTF-8 编码的中文文本,被解释为拉丁文采用 Windows-1252 编码的文本,重新编码为 UTF -8 并再次解释为Windows-1252(当您将其发布到此处时,最后再次编码为 UTF-8)。所以这不仅仅是mojibake,而是双重 mojibake。
此外,在某些时候,字符串中间丢失了一个字节(可能是因为它是 Windows-1252 中未定义的代码点之一),从而破坏了原始字符之一。反向通过编码链运行文本(编码为 Windows-1252,解码为 UTF-8,重复),我得到输出:
其中替换字符
�
代表损坏的字符。This isn't a direct answer to your question, but I played a bit with the mojibake you quoted in your post. It looks like it was originally Chinese text in UTF-8 encoding, which was interpreted as Latin text in Windows-1252 encoding, re-encoded in UTF-8 and again interpreted as Windows-1252 (and finally once more encoded as UTF-8 when you posted it here). So it's not just mojibake, it's double mojibake.
Also, at some point, a byte was lost from the middle of the string (probably because it was one of the undefined code points in Windows-1252), mangling one of the original characters. Running the text through the encoding chain in reverse (encode as Windows-1252, decode as UTF-8, repeat), I get the output:
where the replacement character
�
stands for the mangled character.