将 Postgresql 数据库从包含混合编码类型的 SQL_ASCII 转换为 UTF-8

发布于 2024-09-30 11:24:01 字数 370 浏览 0 评论 0原文

我有一个 postgresql 数据库,我想将其转换为 UTF-8。

问题是它目前是 SQL_ASCII,因此尚未对其输入进行任何类型的编码转换,因此最终在表中得到混合编码类型的数据。一行可能包含编码为 UTF-8 的值,另一行可能包含 ISO-8859-x 或 Windows-125x 等。

这需要执行数据库转储,并将其转换为 UTF-8 以便导入它导入一个新的 UTF-8 数据库,很困难。如果数据都是一种编码类型,我可以通过 iconv 运行转储文件,但我认为这种方法在这里不起作用。

问题本质上是否在于了解每个数据的编码方式?在这里,在不知道的地方,它可以被计算出来,甚至可以被猜测吗?理想情况下,我希望有一个脚本可以接受一个文件,任何文件,并输出有效的 UTF-8。

I have a postgresql database I would like to convert to UTF-8.

The problem is that it is currently SQL_ASCII, so hasn't been doing any kind of encoding conversion on its input, and as such has ended up with data of a mix of encoding types in the tables. One row might contain values encoded as UTF-8, another might be ISO-8859-x, or Windows-125x, etc.

This has made performing a dump of the database, and converting it to UTF-8 with the intention of importing it into a fresh UTF-8 database, difficult. If the data were all of one encoding type, I could just run the dump file through iconv, but I don't think that approach works here.

Is the problem fundamentally down to knowing how each data is encoded? Here, where that is not known, can it be worked out, or even guessed? Ideally I'd love a script which would take a file, any file, and spit out valid UTF-8.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

束缚m 2024-10-07 11:24:01

正是 Encoding::FixLatin 的问题编写是为了解决*。

如果您安装 Perl 模块,那么您还将获得 fix_latin 命令行实用程序,您可以像这样使用它:

pg_restore -O dump_file | fix_latin | psql -d database

阅读“限制' 文档部分以了解其工作原理。

[*] 注意我假设当你说 ISO-8859-x 时你的意思是 ISO-8859-1,当你说 CP125x 时你的意思是 CP1252 - 因为 ASCII、UTF-8、Latin-1 和 WinLatin-1 的混合是一个常见的案例。但如果你确实混合了东西方编码,那么抱歉,你就完蛋了:-(

This is exactly the problem that Encoding::FixLatin was written to solve*.

If you install the Perl module then you'll also get the fix_latin command-line utility which you can use like this:

pg_restore -O dump_file | fix_latin | psql -d database

Read of the 'Limitations' section of the documentation to understand how it works.

[*] Note I'm assuming that when you say ISO-8859-x you mean ISO-8859-1 and when you say CP125x you mean CP1252 - because the mix of ASCII, UTF-8, Latin-1 and WinLatin-1 is a common case. But if you really do have a mixture of eastern and western encodings then sorry but you're screwed :-(

风启觞 2024-10-07 11:24:01

如果没有对数据的一些了解,这是不可能的。您知道这是短信还是人名或地点吗?用某种特定的语言?

您可以尝试对转储的一行进行编码并应用一些启发式 - 例如尝试自动拼写检查器并选择生成最少错误数或最知名单词等的编码。

您可以使用例如 aspell list -l enen 表示英语,pl 表示波兰语,fr 表示法语等)获取拼写错误单词的列表。然后您可以选择生成最少的编码。你需要安装相应的字典包,例如我的 Fedora 13 Linux 系统中的“aspell-en”。

It is impossible without some knowledge of the data first. Do you know if it is a text message or people's names or places? In some particular language?

You can try to encode a line of a dump and apply some heuristic — for example try an automatic spell checker and choose an encoding that generates the lowest number of errors or the most known words etc.

You can use for example aspell list -l en (en for English, pl for Polish, fr for French etc.) to get a list of misspelled words. Then you can choose encoding which generates the least of them. You'd need to install corresponding dictionary package, for example "aspell-en" in my Fedora 13 Linux system.

牵你的手,一向走下去 2024-10-07 11:24:01

事实上,我自己也曾亲眼目睹过这个问题。简短的回答:没有简单的算法。但还有一些希望。

首先,根据我的经验,数据往往是:

  • 99% ASCII
  • .9% UTF-8
  • .1% 其他,其中 75% 是 Windows-1252。

那么我们就用它吧。您需要分析自己的数据集,看看它是否遵循此模式。 (我在美国,所以这是典型的。我想包含欧洲数据的数据库可能不会那么幸运,而更远的东部地区则更不幸运。)

首先,当今大多数编码都包含 ASCII 作为子集。 UTF-8 是这样,ISO-8859-1 是这样。因此,如果一个字段仅包含 [0, 0x7F] 范围内的八位字节(即 ASCII 字符),那么它可能是用 ASCII/UTF-8/ISO- 编码的。 8859-1/等如果您处理的是美式英语,这可能会处理您 99% 的数据。

继续剩下的事情。

UTF-8 有一些很好的属性,因为它要么是 1 字节 ASCII 字符,要么第一个字节之后的所有内容都是二进制的 10xxxxxx。所以:尝试通过 UTF-8 解码器运行剩余的字段(如果你给它垃圾,它就会阻塞。)在它不会阻塞的字段上,我的经验是它们可能是有效的 UTF-8。 (这里可能会出现误报:我们可能有一个棘手的 ISO-8859-1 字段,它也是有效的 UTF-8。)

最后,如果它不是 ASCII,并且它不会解码为 UTF-8,Windows -1252 似乎是下一个值得尝试的好选择。不过,几乎所有内容都是有效的 Windows-1252,因此这里很难出现故障。

您可以这样做:

  • 尝试解码为 ASCII。如果成功,则假定为 ASCII。
  • 尝试解码为 UTF-8。
  • 尝试解码为 Windows-1252

对于 UTF-8 和 Windows-1252,将表的 PK 和“猜测”解码文本输出到文本文件(在输出之前将 Windows-1252 转换为 UTF-8)。让人类检查一下,看看他们是否发现任何不合适的地方。如果没有太多的非 ASCII 数据(就像我说的,如果你在美国,ASCII 往往占主导地位……),那么人类就可以查看整个数据。

另外,如果您对数据的外观有所了解,则可以将解码限制为某些字符。例如,如果某个字段解码为有效的 UTF-8 文本,但包含“©”,并且该字段是人名,那么它可能是误报,应该更仔细地查看。

最后,请注意,当您更改为 UTF-8 数据库时,过去插入这些垃圾数据的任何内容可能仍然存在:您需要跟踪该系统并教它字符编码。

I've seen exactly this problem myself, actually. The short answer: there's no straightforward algorithm. But there is some hope.

First, in my experience, the data tends to be:

  • 99% ASCII
  • .9% UTF-8
  • .1% other, 75% of which is Windows-1252.

So let's use that. You'll want to analyze your own dataset, to see if it follows this pattern. (I am in America, so this is typical. I imagine a DB containing data based in Europe might not be so lucky, and something further east even less so.)

First, most every encoding out there today contains ASCII as a subset. UTF-8 does, ISO-8859-1 does, etc. Thus, if a field contains only octets within the range [0, 0x7F] (ie, ASCII characters), then it's probably encoded in ASCII/UTF-8/ISO-8859-1/etc. If you're dealing with American English, this will probably take care of 99% of your data.

On to what's left.

UTF-8 has some nice properties, in that it will either be 1 byte ASCII characters, OR everything after the first byte will be 10xxxxxx in binary. So: attempt to run your remaining fields through a UTF-8 decoder (one that will choke if you give it garbage.) On the fields it doesn't choke on, my experience has been that they're probably valid UTF-8. (It is possible to get a false positive here: we could have a tricky ISO-8859-1 field that is also valid UTF-8.)

Last, if it's not ASCII, and it doesn't decode as UTF-8, Windows-1252 seems to be the next good choice to try. Almost everything is valid Windows-1252 though, so it's hard to get failures here.

You might do this:

  • Attempt to decode as ASCII. If successful, assume ASCII.
  • Attempt to decode as UTF-8.
  • Attempt to decode as Windows-1252

For the UTF-8 and Windows-1252, output the table's PK and the "guess" decoded text to a text file (convert the Windows-1252 to UTF-8 before outputting). Have a human look over it, see if they see anything out of place. If there's not too much non-ASCII data (and like I said, ASCII tends to dominate, if you're in America...), then a human could look over the whole thing.

Also, if you have some idea about what your data looks like, you could restrict decodings to certain characters. For example, if a field decodes as valid UTF-8 text, but contains a "©", and the field is a person's name, then it was probably a false positive, and should be looked at more closely.

Lastly, be aware that when you change to a UTF-8 database, whatever has been inserting this garbage data in the past is probably still there: you'll need to track down this system and teach it character encoding.

九歌凝 2024-10-07 11:24:01

我解决了使用这个命令;

1-)导出

pg_dump --username=postgres --encoding=ISO88591 database -f database.sql

2-)导入

psql -U postgres -d database < database.sql

之后这些命令帮助我解决了转换SQL_ASCII - UTF-8的问题

I resolved using this commands;

1-) Export

pg_dump --username=postgres --encoding=ISO88591 database -f database.sql

and after

2-) Import

psql -U postgres -d database < database.sql

these commands helped me solve the problem of conversion SQL_ASCII - UTF-8

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文