更正 XML 编码

发布于 2024-10-11 14:06:07 字数 209 浏览 2 评论 0原文

我有一个 xml,其编码标记设置为“utf-8”。但是,它实际上是 iso-8859-1。

以编程方式,如何在 perl 和 python 中检测到这一点?以及如何使用不同的编码进行解码?

在 perl 中,我尝试过

$xml = decode('iso-8859-1',$file)

,但是这不起作用。

I have a xml with encoding tag set to 'utf-8'. But, it is actually iso-8859-1.

Programatically, how do I detect this in perl and python? and how do I decode with a different coding?

In perl, I tried

$xml = decode('iso-8859-1',$file)

but, this does not work.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

败给现实 2024-10-18 14:06:07

众所周知,错误编码很难检测,因为随机二进制数据通常代表许多编码中的有效字符串。

在 Perl 中,您可以尝试的最简单的方法就是尝试将其解码为 utf-8 并检查是否失败。 (它只能以这种方式工作;utf-8 编码的西方语言文档几乎总是有效的 iso-8859-1 文档)

my $xml = eval { decode_utf8( $file, FB_CROAK ) };
if ( $@ ) { is_probably_iso-8859-1_instead }

现在您已经检测到问题,您必须解决它。这很可能取决于您正在使用的解析器库,但一些泛型应该适用。

如果没有 XML 声明或 MIME 类型,则将使用 Perl 本机编码,因此您复制的代码应该可以解决问题。

如果存在错误的 XML 声明,您可以使用 XML 解码库提供的任何工具覆盖它,或者在移交之前手动替换它。

# assuming it's on line 1:
$contents =~ s/.*/<?xml version="1.0" encoding="ISO-8859-1"?>/;

Miscoding is notoriously tricky to detect, as random binary data often represents valid strings in many many encodings.

In Perl, the easiest thing you could try would be to attempt to decode it as utf-8 and check for failures. (it only works this way round; a utf-8 encoded western-language document is almost always a valid iso-8859-1 document as well)

my $xml = eval { decode_utf8( $file, FB_CROAK ) };
if ( $@ ) { is_probably_iso-8859-1_instead }

Now you've detected the problem, you've got to work around it. This will most likely depend on the parser library you're using, but some generics ought to apply.

If there's no XML declaration or MIME-type, the Perl native encoding will be used, so the code you copied should do the trick.

If there's a mistaken XML declaration, you could either override it using any facility your XML decoding library provides, or just replace it manually before handing it over.

# assuming it's on line 1:
$contents =~ s/.*/<?xml version="1.0" encoding="ISO-8859-1"?>/;
亣腦蒛氧 2024-10-18 14:06:07

无论哪种语言,一般过程都应该是相同的:

打开文件,将原始字节读入字符串中。

尝试将 raw_bytes 解码为 UTF-8,并使用检查错误或在不是有效 UTF-8 时引发异常的选项。

成功编码为 ISO-8859-1 的、具有合理长度的有意义的 Unicode 文本文件通过此 UTF-8 测试的机会非常低(当然,除非它是 ASCII,它是 ISO-8859-1 和 UTF-8 的子集) 8).

如果测试失败,则删除 XML 声明(如果存在)。前置:

<?xml version="1.0" encoding="ISO-8859-1"?>

顺便说一下,您确定您确实拥有 ISO-8859-1 数据而不是 CP1252 数据(来自 Windows 平台)吗?

The general procedure should be the same no matter what language:

Open your file, read the raw bytes into a string.

Attempt to decode the raw_bytes as UTF-8, with an option that checks for errors or raises an exception if it is not valid UTF-8.

The chance that a file of meaningful Unicode text of reasonable length successfully encoded as ISO-8859-1 will pass this UTF-8 test is very low (unless of course it's ASCII which is a subset of both ISO-8859-1 and UTF-8).

If the test fails, strip off the XML declaration if it exists. Prepend this:

<?xml version="1.0" encoding="ISO-8859-1"?>

By the way, are you sure you actually have ISO-8859-1 data and not CP1252 data (from a Windows platform)?

夜无邪 2024-10-18 14:06:07

当然,不言而喻,查找并纠正数据损坏的根本原因总是比在事件发生后尝试检测和修复损坏要好。

除此之外,要指出的要点是您的文件不是 XML,因此您无法使用 XML 工具修复它。您需要在字符或二进制级别上攻击它。正如其他人所说,第 1 步是检测它是否不是有效的 UTF-8;第 2 步是去掉不正确的 XML 声明并将其替换为正确的声明。这些都应该不是特别困难。

It goes without saying, of course, that finding and correcting the root cause of a data corruption is always better than trying to detect and repair the corruption after the event.

Apart from that, the main point to make is that your file isn't XML so you can't fix it using XML tools. You need to attack it at the character or binary level. As others have said, step 1 is to detect that it's not valid UTF-8; step 2 is to strip off the incorrect XML declaration and replace it with a correct one. Neither of those should be particularly difficult.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文