纠正错误编码导致的乱码数据的最佳方法

发布于 2024-08-24 01:23:53 字数 429 浏览 2 评论 0原文

我有一组数据,其中包含乱码文本字段,因为在从一个数据库到另一个数据库的许多导入/导出过程中出现编码错误。大多数错误是由将 UTF-8 转换为 ISO-8859-1 引起的。奇怪的是,错误并不一致:单词“München”在某些地方显示为“München”在其他地方也称为“Mönchen”。

SQL Server 有什么技巧可以纠正这种错误吗?我能想到的第一件事是利用 COLLATE 子句,以便将 ü 解释为 ü,但我不完全知道知识。如果无法在数据库级别实现,您知道有什么工具可以帮助进行批量校正吗? (没有手动查找/替换工具,但有一个以某种方式猜测乱码文本并纠正它们的工具)

I have a set of data that contains garbled text fields because of encoding errors during many import/exports from one database to another. Most of the errors were caused by converting UTF-8 to ISO-8859-1. Strangely enough, the errors are not consistent: the word 'München' appears as 'München' in some place and also as 'MÃœnchen' in somewhere else.

Is there a trick in SQL server to correct this kind of crap? The first thing that I can think of is to exploit the COLLATE clause, so that ü is interpreted as ü, but I don't exactly know how. If it isn't possible to make it in the DB level, do you know any tool that helps for a bulk correction? (no manual find/replace tool, but a tool that guesses the garbled text somehow and correct them)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

虚拟世界 2024-08-31 01:23:53

我曾经处于完全相同的位置。生产MySQL服务器设置为latin1,旧数据为latin1,新数据为utf8但存储到latin1列,然后添加utf8列......每行可以包含任意数量的编码。

最大的问题是,没有一种解决方案可以纠正所有问题,因为许多遗留编码对不同字符使用相同的字节。这意味着您将不得不诉诸启发法。在我的 Utf8Voodoo 类中,有一个从 127 到 255 的巨大字节数组,也称为传统单字节编码非 ASCII 字符。

// ISO-8859-15 has the Euro sign, but ISO-8859-1 has also been used on the
// site. Sigh. Windows-1252 has the Euro sign at 0x80 (and other printable
// characters in 0x80-0x9F), but mb_detect_encoding never returns that
// encoding when ISO-8859-* is in the detect list, so we cannot use it.
// CP850 has accented letters and currency symbols in 0x80-0x9F. It occurs
// just a few times, but enough to make it pretty much impossible to
// automagically detect exactly which non-ISO encoding was used. Hence the
// need for "likely bytes" in addition to the "magic bytes" below.

/**
 * This array contains the magic bytes that determine possible encodings.
 * It works by elimination: the most specific byte patterns (the array's
 * keys) are listed first. When a match is found, the possible encodings
 * are that entry's value.
 */
public static $legacyEncodingsMagicBytes = array(
    '/[\x81\x8D\x8F\x90\x9D]/' => array('CP850'),
    '/[\x80\x82-\x8C\x8E\x91-\x9C\x9E\x9F]/' => array('Windows-1252', 'CP850'),
    '/./' => array('ISO-8859-15', 'ISO-8859-1', 'Windows-1252', 'CP850'),
);

/**
 * This array contains the bytes that make it more likely for a string to
 * be a certain encoding. The keys are the pattern, the values are arrays
 * with (encoding => likeliness score modifier).
 */
public static $legacyEncodingsLikelyBytes = array(
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x80 | -      | -      | €      | Ç
    '/\x80/' => array(
        'Windows-1252' => +10,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x93 | -      | -      | “      | ô
    // 0x94 | -      | -      | ”      | ö
    // 0x95 | -      | -      | •      | ò
    // 0x96 | -      | -      | –      | û
    // 0x97 | -      | -      | —      | ù
    // 0x99 | -      | -      | ™      | Ö
    '/[\x93-\x97\x99]/' => array(
        'Windows-1252' => +1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x86 | -      | -      | †      | å
    // 0x87 | -      | -      | ‡      | ç
    // 0x89 | -      | -      | ‰      | ë
    // 0x8A | -      | -      | Š      | è
    // 0x8C | -      | -      | Œ      | î
    // 0x8E | -      | -      | Ž      | Ä
    // 0x9A | -      | -      | š      | Ü
    // 0x9C | -      | -      | œ      | £
    // 0x9E | -      | -      | ž      | ×
    '/[\x86\x87\x89\x8A\x8C\x8E\x9A\x9C\x9E]/' => array(
        'Windows-1252' => -1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0xA4 | ¤      | €      | ¤      | ñ
    '/\xA4/' => array(
        'ISO-8859-15' => +10,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0xA6 | ¦      | Š      | ¦      | ª
    // 0xBD | ½      | œ      | ½      | ¢
    '/[\xA6\xBD]/' => array(
        'ISO-8859-15' => -1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x82 | -      | -      | ‚      | é
    // 0xA7 | §      | §      | §      | º
    // 0xFD | ý      | ý      | ý      | ²
    '/[\x82\xA7\xCF\xFD]/' => array(
        'CP850' => +1
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x91 | -      | -      | ‘      | æ
    // 0x92 | -      | -      | ’      | Æ
    // 0xB0 | °      | °      | °      | ░
    // 0xB1 | ±      | ±      | ±      | ▒
    // 0xB2 | ²      | ²      | ²      | ▓
    // 0xB3 | ³      | ³      | ³      | │
    // 0xB9 | ¹      | ¹      | ¹      | ╣
    // 0xBA | º      | º      | º      | ║
    // 0xBB | »      | »      | »      | ╗
    // 0xBC | ¼      | Œ      | ¼      | ╝
    // 0xC1 | Á      | Á      | Á      | ┴
    // 0xC2 | Â      | Â      | Â      | ┬
    // 0xC3 | Ã      | Ã      | Ã      | ├
    // 0xC4 | Ä      | Ä      | Ä      | ─
    // 0xC5 | Å      | Å      | Å      | ┼
    // 0xC8 | È      | È      | È      | ╚
    // 0xC9 | É      | É      | É      | ╔
    // 0xCA | Ê      | Ê      | Ê      | ╩
    // 0xCB | Ë      | Ë      | Ë      | ╦
    // 0xCC | Ì      | Ì      | Ì      | ╠
    // 0xCD | Í      | Í      | Í      | ═
    // 0xCE | Î      | Î      | Î      | ╬
    // 0xD9 | Ù      | Ù      | Ù      | ┘
    // 0xDA | Ú      | Ú      | Ú      | ┌
    // 0xDB | Û      | Û      | Û      | █
    // 0xDC | Ü      | Ü      | Ü      | ▄
    // 0xDF | ß      | ß      | ß      | ▀
    // 0xE7 | ç      | ç      | ç      | þ
    // 0xE8 | è      | è      | è      | Þ
    '/[\x91\x92\xB0-\xB3\xB9-\xBC\xC1-\xC5\xC8-\xCE\xD9-\xDC\xDF\xE7\xE8]/' => array(
        'CP850' => -1
    ),
/* etc. */

然后循环遍历字符串中的字节(而不是字符)并保留分数。如果您需要更多信息,请告诉我。

I have been in exactly the same position. The production MySQL server was set up to be latin1, old data was latin1, new data was utf8 but stored to latin1 columns, then utf8 columns were added… Each row could contain any number of encodings.

The big problem is that there is no single one solution that corrects everything, because a lot of legacy encodings use the same bytes for different characters. That means you will have to resort to heuristics. In my Utf8Voodoo class, there is a huge array of the bytes from 127 to 255, a.k.a. the legacy single-byte encoding non-ASCII characters.

// ISO-8859-15 has the Euro sign, but ISO-8859-1 has also been used on the
// site. Sigh. Windows-1252 has the Euro sign at 0x80 (and other printable
// characters in 0x80-0x9F), but mb_detect_encoding never returns that
// encoding when ISO-8859-* is in the detect list, so we cannot use it.
// CP850 has accented letters and currency symbols in 0x80-0x9F. It occurs
// just a few times, but enough to make it pretty much impossible to
// automagically detect exactly which non-ISO encoding was used. Hence the
// need for "likely bytes" in addition to the "magic bytes" below.

/**
 * This array contains the magic bytes that determine possible encodings.
 * It works by elimination: the most specific byte patterns (the array's
 * keys) are listed first. When a match is found, the possible encodings
 * are that entry's value.
 */
public static $legacyEncodingsMagicBytes = array(
    '/[\x81\x8D\x8F\x90\x9D]/' => array('CP850'),
    '/[\x80\x82-\x8C\x8E\x91-\x9C\x9E\x9F]/' => array('Windows-1252', 'CP850'),
    '/./' => array('ISO-8859-15', 'ISO-8859-1', 'Windows-1252', 'CP850'),
);

/**
 * This array contains the bytes that make it more likely for a string to
 * be a certain encoding. The keys are the pattern, the values are arrays
 * with (encoding => likeliness score modifier).
 */
public static $legacyEncodingsLikelyBytes = array(
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x80 | -      | -      | €      | Ç
    '/\x80/' => array(
        'Windows-1252' => +10,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x93 | -      | -      | “      | ô
    // 0x94 | -      | -      | ”      | ö
    // 0x95 | -      | -      | •      | ò
    // 0x96 | -      | -      | –      | û
    // 0x97 | -      | -      | —      | ù
    // 0x99 | -      | -      | ™      | Ö
    '/[\x93-\x97\x99]/' => array(
        'Windows-1252' => +1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x86 | -      | -      | †      | å
    // 0x87 | -      | -      | ‡      | ç
    // 0x89 | -      | -      | ‰      | ë
    // 0x8A | -      | -      | Š      | è
    // 0x8C | -      | -      | Œ      | î
    // 0x8E | -      | -      | Ž      | Ä
    // 0x9A | -      | -      | š      | Ü
    // 0x9C | -      | -      | œ      | £
    // 0x9E | -      | -      | ž      | ×
    '/[\x86\x87\x89\x8A\x8C\x8E\x9A\x9C\x9E]/' => array(
        'Windows-1252' => -1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0xA4 | ¤      | €      | ¤      | ñ
    '/\xA4/' => array(
        'ISO-8859-15' => +10,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0xA6 | ¦      | Š      | ¦      | ª
    // 0xBD | ½      | œ      | ½      | ¢
    '/[\xA6\xBD]/' => array(
        'ISO-8859-15' => -1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x82 | -      | -      | ‚      | é
    // 0xA7 | §      | §      | §      | º
    // 0xFD | ý      | ý      | ý      | ²
    '/[\x82\xA7\xCF\xFD]/' => array(
        'CP850' => +1
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x91 | -      | -      | ‘      | æ
    // 0x92 | -      | -      | ’      | Æ
    // 0xB0 | °      | °      | °      | ░
    // 0xB1 | ±      | ±      | ±      | ▒
    // 0xB2 | ²      | ²      | ²      | ▓
    // 0xB3 | ³      | ³      | ³      | │
    // 0xB9 | ¹      | ¹      | ¹      | ╣
    // 0xBA | º      | º      | º      | ║
    // 0xBB | »      | »      | »      | ╗
    // 0xBC | ¼      | Œ      | ¼      | ╝
    // 0xC1 | Á      | Á      | Á      | ┴
    // 0xC2 | Â      | Â      | Â      | ┬
    // 0xC3 | Ã      | Ã      | Ã      | ├
    // 0xC4 | Ä      | Ä      | Ä      | ─
    // 0xC5 | Å      | Å      | Å      | ┼
    // 0xC8 | È      | È      | È      | ╚
    // 0xC9 | É      | É      | É      | ╔
    // 0xCA | Ê      | Ê      | Ê      | ╩
    // 0xCB | Ë      | Ë      | Ë      | ╦
    // 0xCC | Ì      | Ì      | Ì      | ╠
    // 0xCD | Í      | Í      | Í      | ═
    // 0xCE | Î      | Î      | Î      | ╬
    // 0xD9 | Ù      | Ù      | Ù      | ┘
    // 0xDA | Ú      | Ú      | Ú      | ┌
    // 0xDB | Û      | Û      | Û      | █
    // 0xDC | Ü      | Ü      | Ü      | ▄
    // 0xDF | ß      | ß      | ß      | ▀
    // 0xE7 | ç      | ç      | ç      | þ
    // 0xE8 | è      | è      | è      | Þ
    '/[\x91\x92\xB0-\xB3\xB9-\xBC\xC1-\xC5\xC8-\xCE\xD9-\xDC\xDF\xE7\xE8]/' => array(
        'CP850' => -1
    ),
/* etc. */

Then you loop over the bytes (not characters) in the strings and keep the scores. Let me know if you want some more info.

你又不是我 2024-08-31 01:23:53

下载 iconv - 您可以获得 Win32 以及 Unix/Linux 的二进制文件。它是一个命令行工具,它接受源文件,并在指定输入编码和输出编码后,将为您进行必要的转换为 STDOUT。

我发现自己广泛使用它来将 UTF-16(由 SQL Server 2005 导出文件输出)转换为 ASCII。

您可以从这里下载: http://gnuwin32.sourceforge.net/packages/libiconv.htm< /a>

Download iconv - you can get binaries for Win32 as well as Unix/Linux. It is a command-line tool that takes a source file and, after specifying input encodings and output encodings, will do the necessary conversion for you to STDOUT.

I find myself using this extensively to convert from UTF-16 (as output by SQL Server 2005 export files) to ASCII.

You can download from here: http://gnuwin32.sourceforge.net/packages/libiconv.htm

猫烠⑼条掵仅有一顆心 2024-08-31 01:23:53

考虑到数据的复杂性(单行/条目上的多种编码)。我认为您必须导出/转储数据,然后对其进行处理。

我认为最好的方法最终是一系列手动替换。也许某种拼写纠正代码可以找到所有错误 - 然后您可以添加显式纠正代码。然后迭代直到拼写检查停止发现错误?

(显然将任何正确的单词添加到字典中以供拼写检查器使用)。

Given the complexity of the data (multiple encodings on a single line/entry). I think you'll have to export/dump the data, and then run processing against that.

I think the best way will end up being a sequence of manual replacements. Maybe some kind of spelling correction code could find all the errors - then you can add explicit correction code. Then iterate till the spelling check stops finding errors?

(Obviously add any correct words to the dictionary for the spelling checker).

话少心凉 2024-08-31 01:23:53

看看 https://github.com/LuminosoInsight/python-ftfy - 它相当不错进行启发式修正,这将解决比您在查看小数据样本时所期望的更丑陋的问题。

Have a look at https://github.com/LuminosoInsight/python-ftfy - its quite good at doing a heuristic correction that will take care of quite some more ugliness than what you expect when you look at small samples of your data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文