纠正错误编码导致的乱码数据的最佳方法

发布于 2024-08-24 01:23:53 字数 429 浏览 2 评论 0原文

我有一组数据，其中包含乱码文本字段，因为在从一个数据库到另一个数据库的许多导入/导出过程中出现编码错误。大多数错误是由将 UTF-8 转换为 ISO-8859-1 引起的。奇怪的是，错误并不一致：单词“München”在某些地方显示为“München”在其他地方也称为“Mönchen”。

SQL Server 有什么技巧可以纠正这种错误吗？我能想到的第一件事是利用 COLLATE 子句，以便将 ü 解释为 ü，但我不完全知道知识。如果无法在数据库级别实现，您知道有什么工具可以帮助进行批量校正吗？（没有手动查找/替换工具，但有一个以某种方式猜测乱码文本并纠正它们的工具）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

虚拟世界 2024-08-31 01:23:53

我曾经处于完全相同的位置。生产MySQL服务器设置为latin1，旧数据为latin1，新数据为utf8但存储到latin1列，然后添加utf8列......每行可以包含任意数量的编码。

最大的问题是，没有一种解决方案可以纠正所有问题，因为许多遗留编码对不同字符使用相同的字节。这意味着您将不得不诉诸启发法。在我的 Utf8Voodoo 类中，有一个从 127 到 255 的巨大字节数组，也称为传统单字节编码非 ASCII 字符。

// ISO-8859-15 has the Euro sign, but ISO-8859-1 has also been used on the
// site. Sigh. Windows-1252 has the Euro sign at 0x80 (and other printable
// characters in 0x80-0x9F), but mb_detect_encoding never returns that
// encoding when ISO-8859-* is in the detect list, so we cannot use it.
// CP850 has accented letters and currency symbols in 0x80-0x9F. It occurs
// just a few times, but enough to make it pretty much impossible to
// automagically detect exactly which non-ISO encoding was used. Hence the
// need for "likely bytes" in addition to the "magic bytes" below.

/**
 * This array contains the magic bytes that determine possible encodings.
 * It works by elimination: the most specific byte patterns (the array's
 * keys) are listed first. When a match is found, the possible encodings
 * are that entry's value.
 */
public static $legacyEncodingsMagicBytes = array(
    '/[\x81\x8D\x8F\x90\x9D]/' => array('CP850'),
    '/[\x80\x82-\x8C\x8E\x91-\x9C\x9E\x9F]/' => array('Windows-1252', 'CP850'),
    '/./' => array('ISO-8859-15', 'ISO-8859-1', 'Windows-1252', 'CP850'),
);

/**
 * This array contains the bytes that make it more likely for a string to
 * be a certain encoding. The keys are the pattern, the values are arrays
 * with (encoding => likeliness score modifier).
 */
public static $legacyEncodingsLikelyBytes = array(
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x80 | -      | -      | €      | Ç
    '/\x80/' => array(
        'Windows-1252' => +10,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x93 | -      | -      | “      | ô
    // 0x94 | -      | -      | ”      | ö
    // 0x95 | -      | -      | •      | ò
    // 0x96 | -      | -      | –      | û
    // 0x97 | -      | -      | —      | ù
    // 0x99 | -      | -      | ™      | Ö
    '/[\x93-\x97\x99]/' => array(
        'Windows-1252' => +1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x86 | -      | -      | †      | å
    // 0x87 | -      | -      | ‡      | ç
    // 0x89 | -      | -      | ‰      | ë
    // 0x8A | -      | -      | Š      | è
    // 0x8C | -      | -      | Œ      | î
    // 0x8E | -      | -      | Ž      | Ä
    // 0x9A | -      | -      | š      | Ü
    // 0x9C | -      | -      | œ      | £
    // 0x9E | -      | -      | ž      | ×
    '/[\x86\x87\x89\x8A\x8C\x8E\x9A\x9C\x9E]/' => array(
        'Windows-1252' => -1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0xA4 | ¤      | €      | ¤      | ñ
    '/\xA4/' => array(
        'ISO-8859-15' => +10,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0xA6 | ¦      | Š      | ¦      | ª
    // 0xBD | ½      | œ      | ½      | ¢
    '/[\xA6\xBD]/' => array(
        'ISO-8859-15' => -1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x82 | -      | -      | ‚      | é
    // 0xA7 | §      | §      | §      | º
    // 0xFD | ý      | ý      | ý      | ²
    '/[\x82\xA7\xCF\xFD]/' => array(
        'CP850' => +1
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x91 | -      | -      | ‘      | æ
    // 0x92 | -      | -      | ’      | Æ
    // 0xB0 | °      | °      | °      | ░
    // 0xB1 | ±      | ±      | ±      | ▒
    // 0xB2 | ²      | ²      | ²      | ▓
    // 0xB3 | ³      | ³      | ³      | │
    // 0xB9 | ¹      | ¹      | ¹      | ╣
    // 0xBA | º      | º      | º      | ║
    // 0xBB | »      | »      | »      | ╗
    // 0xBC | ¼      | Œ      | ¼      | ╝
    // 0xC1 | Á      | Á      | Á      | ┴
    // 0xC2 | Â      | Â      | Â      | ┬
    // 0xC3 | Ã      | Ã      | Ã      | ├
    // 0xC4 | Ä      | Ä      | Ä      | ─
    // 0xC5 | Å      | Å      | Å      | ┼
    // 0xC8 | È      | È      | È      | ╚
    // 0xC9 | É      | É      | É      | ╔
    // 0xCA | Ê      | Ê      | Ê      | ╩
    // 0xCB | Ë      | Ë      | Ë      | ╦
    // 0xCC | Ì      | Ì      | Ì      | ╠
    // 0xCD | Í      | Í      | Í      | ═
    // 0xCE | Î      | Î      | Î      | ╬
    // 0xD9 | Ù      | Ù      | Ù      | ┘
    // 0xDA | Ú      | Ú      | Ú      | ┌
    // 0xDB | Û      | Û      | Û      | █
    // 0xDC | Ü      | Ü      | Ü      | ▄
    // 0xDF | ß      | ß      | ß      | ▀
    // 0xE7 | ç      | ç      | ç      | þ
    // 0xE8 | è      | è      | è      | Þ
    '/[\x91\x92\xB0-\xB3\xB9-\xBC\xC1-\xC5\xC8-\xCE\xD9-\xDC\xDF\xE7\xE8]/' => array(
        'CP850' => -1
    ),
/* etc. */

然后循环遍历字符串中的字节（而不是字符）并保留分数。如果您需要更多信息，请告诉我。

I have been in exactly the same position. The production MySQL server was set up to be latin1, old data was latin1, new data was utf8 but stored to latin1 columns, then utf8 columns were added… Each row could contain any number of encodings.

The big problem is that there is no single one solution that corrects everything, because a lot of legacy encodings use the same bytes for different characters. That means you will have to resort to heuristics. In my Utf8Voodoo class, there is a huge array of the bytes from 127 to 255, a.k.a. the legacy single-byte encoding non-ASCII characters.

// ISO-8859-15 has the Euro sign, but ISO-8859-1 has also been used on the
// site. Sigh. Windows-1252 has the Euro sign at 0x80 (and other printable
// characters in 0x80-0x9F), but mb_detect_encoding never returns that
// encoding when ISO-8859-* is in the detect list, so we cannot use it.
// CP850 has accented letters and currency symbols in 0x80-0x9F. It occurs
// just a few times, but enough to make it pretty much impossible to
// automagically detect exactly which non-ISO encoding was used. Hence the
// need for "likely bytes" in addition to the "magic bytes" below.

/**
 * This array contains the magic bytes that determine possible encodings.
 * It works by elimination: the most specific byte patterns (the array's
 * keys) are listed first. When a match is found, the possible encodings
 * are that entry's value.
 */
public static $legacyEncodingsMagicBytes = array(
    '/[\x81\x8D\x8F\x90\x9D]/' => array('CP850'),
    '/[\x80\x82-\x8C\x8E\x91-\x9C\x9E\x9F]/' => array('Windows-1252', 'CP850'),
    '/./' => array('ISO-8859-15', 'ISO-8859-1', 'Windows-1252', 'CP850'),
);

/**
 * This array contains the bytes that make it more likely for a string to
 * be a certain encoding. The keys are the pattern, the values are arrays
 * with (encoding => likeliness score modifier).
 */
public static $legacyEncodingsLikelyBytes = array(
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x80 | -      | -      | €      | Ç
    '/\x80/' => array(
        'Windows-1252' => +10,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x93 | -      | -      | “      | ô
    // 0x94 | -      | -      | ”      | ö
    // 0x95 | -      | -      | •      | ò
    // 0x96 | -      | -      | –      | û
    // 0x97 | -      | -      | —      | ù
    // 0x99 | -      | -      | ™      | Ö
    '/[\x93-\x97\x99]/' => array(
        'Windows-1252' => +1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x86 | -      | -      | †      | å
    // 0x87 | -      | -      | ‡      | ç
    // 0x89 | -      | -      | ‰      | ë
    // 0x8A | -      | -      | Š      | è
    // 0x8C | -      | -      | Œ      | î
    // 0x8E | -      | -      | Ž      | Ä
    // 0x9A | -      | -      | š      | Ü
    // 0x9C | -      | -      | œ      | £
    // 0x9E | -      | -      | ž      | ×
    '/[\x86\x87\x89\x8A\x8C\x8E\x9A\x9C\x9E]/' => array(
        'Windows-1252' => -1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0xA4 | ¤      | €      | ¤      | ñ
    '/\xA4/' => array(
        'ISO-8859-15' => +10,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0xA6 | ¦      | Š      | ¦      | ª
    // 0xBD | ½      | œ      | ½      | ¢
    '/[\xA6\xBD]/' => array(
        'ISO-8859-15' => -1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x82 | -      | -      | ‚      | é
    // 0xA7 | §      | §      | §      | º
    // 0xFD | ý      | ý      | ý      | ²
    '/[\x82\xA7\xCF\xFD]/' => array(
        'CP850' => +1
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x91 | -      | -      | ‘      | æ
    // 0x92 | -      | -      | ’      | Æ
    // 0xB0 | °      | °      | °      | ░
    // 0xB1 | ±      | ±      | ±      | ▒
    // 0xB2 | ²      | ²      | ²      | ▓
    // 0xB3 | ³      | ³      | ³      | │
    // 0xB9 | ¹      | ¹      | ¹      | ╣
    // 0xBA | º      | º      | º      | ║
    // 0xBB | »      | »      | »      | ╗
    // 0xBC | ¼      | Œ      | ¼      | ╝
    // 0xC1 | Á      | Á      | Á      | ┴
    // 0xC2 | Â      | Â      | Â      | ┬
    // 0xC3 | Ã      | Ã      | Ã      | ├
    // 0xC4 | Ä      | Ä      | Ä      | ─
    // 0xC5 | Å      | Å      | Å      | ┼
    // 0xC8 | È      | È      | È      | ╚
    // 0xC9 | É      | É      | É      | ╔
    // 0xCA | Ê      | Ê      | Ê      | ╩
    // 0xCB | Ë      | Ë      | Ë      | ╦
    // 0xCC | Ì      | Ì      | Ì      | ╠
    // 0xCD | Í      | Í      | Í      | ═
    // 0xCE | Î      | Î      | Î      | ╬
    // 0xD9 | Ù      | Ù      | Ù      | ┘
    // 0xDA | Ú      | Ú      | Ú      | ┌
    // 0xDB | Û      | Û      | Û      | █
    // 0xDC | Ü      | Ü      | Ü      | ▄
    // 0xDF | ß      | ß      | ß      | ▀
    // 0xE7 | ç      | ç      | ç      | þ
    // 0xE8 | è      | è      | è      | Þ
    '/[\x91\x92\xB0-\xB3\xB9-\xBC\xC1-\xC5\xC8-\xCE\xD9-\xDC\xDF\xE7\xE8]/' => array(
        'CP850' => -1
    ),
/* etc. */

Then you loop over the bytes (not characters) in the strings and keep the scores. Let me know if you want some more info.

回复收藏 0 原文