替换mysql中的垃圾字符

发布于 2025-01-05 04:23:53 字数 800 浏览 0 评论 0原文

我的数据库位于 latin1 中,并且充满了 â"'��"' (取决于我的终端是否是分别设置为 latin1 或 unicode)。从上下文来看,我认为它们应该是破折号。当在 IE 中渲染(或不渲染)时,它们似乎会导致严重的错误。我想找到并替换它们。问题是 â 和 � 字符都不与 replace 匹配。运行查询:

    update TABLE set COLUMN = replace(COLUMN,'��"','---');

执行时没有错误,但不执行任何操作(0 行已更改)。我很清楚,当我在终端中复制它时,“菱形中的问号”字符没有被匹配。有没有办法找到它的代码并通过它或其他东西进行匹配? mysql 控制台非常接近能够在一行中完成此操作,因此如果可以避免的话,我宁愿不在终端之外编写它的脚本。

该数据库托管在 Amazon RDS 上,因此我无法安装我在其他问题中看到的 regexp udf。从长远来看,我必须将整个数据库正确转换为 utf8,但我需要立即解决此渲染问题。

编辑:

我已经用 hexdump 隔离了坏字符,它是 e2 80 (我不认为这对应于任何 unicode 字符)。我如何将其提供给替换功能?

    update TABLE set COLUMN = replace(COLUMN, char(0xe2,0x80),'---');

不执行任何操作。

My db is in latin1 and is full of â" or '��"' (depending on whether my terminal is set to latin1 or unicode, respectively). From context, I think they should be emdashes. They appear to be causing nasty bugs when rendered (or not rendered) in IE. I'd like to find and replace them. The problem is that neither the â nor � character match with replace. Running the query:

    update TABLE set COLUMN = replace(COLUMN,'��"','---');

Executes without error but doesn't do anything (0 rows changed). It's clear to me that the "question mark in the diamond" character is not being matched when I copy it in the terminal. Is there a way to find out its code and match it by that or something? The mysql console is tantalizingly close to being able to do this in one line so I'd rather not script it outside the terminal if I can avoid it.

The db is hosted Amazon RDS so I can't install the regexp udf that I've seen referenced in other questions here. In the long term, I'm going to have to properly convert the whole db to utf8 but I need to fix this rendering problem right away.

EDIT:

I've isolated the bad character with hexdump, it's e2 80 (I don't think this corresponds to any unicode character). How can I feed that to the replace function?

    update TABLE set COLUMN = replace(COLUMN, char(0xe2,0x80),'---');

does not do anything.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

情仇皆在手 2025-01-12 04:23:53

我想通了。我使用 mysql 的内置 hex 函数转储一个我知道是错误的条目。

    select hex(column) from table where id=666;

然后挑出单词(那些夹在“20”之间的数字),发现我的违规字节集实际上是 x'C3A2E282AC2671756F743B'。这与我在 PHP 和我的系统中看到的编码方式(如 e2 80)有何对应,我不知道,而且目前我并不关心。

为了验证,在销毁数据之前,将其插回 mysql:

    select x'C3A2E282AC2671756F743B';
    +---------------------------+
    | x'C3A2E282AC2671756F743B' |
    +---------------------------+
    | â€"               |
    +---------------------------+
    1 row in set (0.00 sec)

因此,使用上面的替换查询,我能够立即删除所有坏数据。

郑重声明:

    update TABLE set COLUMN = replace(COLUMN, x'C3A2E282AC2671756F743B','--');

我真的希望这对某人有用。虽然编码混乱在 mysql 中似乎很常见,但我到处搜索,但找不到这个最终相当简单的过程的解释。

I figured it out. I used mysql's builtin hex function to dump an entry that I knew was bad.

    select hex(column) from table where id=666;

Then picked out the words (those numbers sandwiched between "20"s) and discovered that my offending set of bytes was in fact x'C3A2E282AC2671756F743B'. How this corresponds to the way I saw it encoded in PHP and by my system (as e2 80) I don't know and at this point, I don't really care.

To verify, before destroying the data, you plug that back in to mysql:

    select x'C3A2E282AC2671756F743B';
    +---------------------------+
    | x'C3A2E282AC2671756F743B' |
    +---------------------------+
    | â€"               |
    +---------------------------+
    1 row in set (0.00 sec)

So, using the replace query like above, I was able to get rid of all the bad data at once.

For the record it was:

    update TABLE set COLUMN = replace(COLUMN, x'C3A2E282AC2671756F743B','--');

I really hope this is useful for someone. Though encoding snafus appear to be pretty common in mysql, I searched everywhere and I couldn't find an explanation for this ultimately rather simple process.

风苍溪 2025-01-12 04:23:53

有些东西可能已经提交了带有 UTF-8 编码 (E2 80 94) 的长破折号 (U+2014),您现在尝试将其解释为 latin-1,但这在 latin-1 中无效,因为它排除了范围 80- 9F,因此您可能会看到非法字节的 Unicode 替换字符,从而在显示时导致 â��。相比之下,在 Windows-1252 中它将显示为 -

您可以使用 CHAR(0xE2, 0x80, 0x94) 创建搜索字符串。

Something might have submitted the em dash (U+2014) with UTF-8 encoding (E2 80 94), which you are now trying to interpret as latin-1, however this is not valid in latin-1 which excludes the range 80-9F so you probably see Unicode replacement characters for the illegal bytes instead, thus resulting in � when you display it. By comparison in Windows-1252 it would display as —.

You may be able to use CHAR(0xE2, 0x80, 0x94) to create the search string.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文