在 Perl 中读取和写入未知编码的 XML 文件?

发布于 2024-11-10 07:37:18 字数 742 浏览 10 评论 0 原文

我正在捡起别人的大项目的一部分并试图纠正错误。问题是,我只是不确定正确的方法是什么。

因此,我正在卷曲一堆 HTML 页面,然后使用简单的命令将其写入文件,例如:

$src = `curl http://google.com`;
open FILE, ">output.html";
print FILE $src;
close FILE;

现在我希望将它们保存为 UTF-8。它保存为什么?然后,我使用相同的基本“打开”命令读取 html 文件,使用正则表达式调用解析 html,并使用字符串连接来创建一个大字符串并将其写入 XML 文件(使用与上面相同的代码)。我已经开始使用 XML::Writer 来代替,但现在我必须检查并修复编码不准确的文件。

因此,我不再拥有 html,但仍然拥有必须显示正确字符的 XML。这是一个示例: http://filevo.com/wkkixmebxlmh.html

主要问题是检测和替换带有“\x{2019}”的相关字符可以在编辑器中正确显示。但我无法找出一个正则表达式来实际捕获野外的角色。

更新:

我仍然无法检测到上面上传到 Filevo 的 XML 文件中的 ALT-0146 字符。我尝试以 UTF-8 打开它,并搜索 /\x{2019}/、/chr(0x2019)/,但只是 /'/,什么也没有。

I am picking up pieces of someone else's large project and trying to right the wrongs. The problem is, I'm just not sure what the correct ways are.

So, I am cURLing a bunch of HTML pages, then writing it to files with simple commands like:

$src = `curl http://google.com`;
open FILE, ">output.html";
print FILE $src;
close FILE;

Now I wanted those to be saved as UTF-8. What is it saved as? Then I am reading the html file in using the same basic 'open' command, parsing the html with regex calls, and using string concatenation to make a big string and writing it to an XML file (using the same code as above). I have already started using XML::Writer instead, but now I must go through and fix the files that have inaccurate encoding.

So, I don't have the html anymore, but I still have the XML that have to display proper characters. Here is an example: http://filevo.com/wkkixmebxlmh.html

The main problem is detecting and replacing the character in question with a "\x{2019}" that displays in editors properly. But I can't figure out a regex to actually capture the character in the wild.

UPDATE:

I still cannot detect the ALT-0146 character that's in the XML file I uploaded to Filevo above. I've tried opening it in UTF-8, and searching for /\x{2019}/, /chr(0x2019)/, and just /’/, nothing.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

許願樹丅啲祈禱 2024-11-17 07:37:18

发现 HTML 文档的编码是困难的。请参阅http://blog.whatwg.org/the-road- to-html-5-character-encoding,特别是它需要一个“7步算法;第4步有2个子步骤,第一个有7个分支,其中一个有8个子步骤,其中之一实际上链接到一个单独的算法,它本身有 7 个步骤……这样持续了一段时间。”

这是我在解析 HTML 文件时用于满足有限需求的。

my $CHARACTER_SET_CLASS = '\w:.()-';

     # X(HT)?ML: http://www.w3.org/International/O-charset
     /\<\?xml [^>]*(?<= )encoding=[\'\"]?([$CHARACTER_SET_CLASS]+)/ ||
     # X?HTML: http://blog.whatwg.org/the-road-to-html-5-character-encoding
     /\<meta [^>]*\bcharset=["']?([$CHARACTER_SET_CLASS]+)/i ||
     # CSS: http://www.w3.org/International/questions/qa-css-charset
     /\@charset "([^\"]*)"/ ||

Discovering the encoding of a HTML document is hard. See http://blog.whatwg.org/the-road-to-html-5-character-encoding and especially that it requires a "7-step algorithm; step 4 has 2 sub-steps, the first of which has 7 branches, one of which has 8 sub-steps, one of which actually links to a separate algorithm that itself has 7 steps... It goes on like that for a while."

This is what I used for a my limited needs in parsing HTML files.

my $CHARACTER_SET_CLASS = '\w:.()-';

     # X(HT)?ML: http://www.w3.org/International/O-charset
     /\<\?xml [^>]*(?<= )encoding=[\'\"]?([$CHARACTER_SET_CLASS]+)/ ||
     # X?HTML: http://blog.whatwg.org/the-road-to-html-5-character-encoding
     /\<meta [^>]*\bcharset=["']?([$CHARACTER_SET_CLASS]+)/i ||
     # CSS: http://www.w3.org/International/questions/qa-css-charset
     /\@charset "([^\"]*)"/ ||
泛滥成性 2024-11-17 07:37:18

为了确保以 UTF-8 生成输出,请使用 binmode

open FILE, '>output.html';
binmode FILE, ':utf8';

或在 3 参数 open 中将 utf8 层应用到输出流call

open FILE, '>:utf8', 'output.html'

任意输入比较棘手。如果你幸运的话,HTML 输入会尽早告诉你它的编码:

wget http://www.google.com/ -O foo ; head -1 foo

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; 
charset=ISO-8859-1"><title>Google</title><script>window.google=
{kEI:"xgngTYnYIoPbgQevid3cCg",kEXPI:"23933,28505,29134,29229,29658,
29695,29795,29822,29892,30111,30174,30215,30275,30562",kCSI:
{e:"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562",ei:"xgngTYnYIoPbgQevid3cCg",expi:
"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562"},authuser:0,ml:function(){},kHL:"en",
time:function(){return(new Date).getTime()},

啊,它就是: content="text/html;
charset=ISO-8859-1">
。现在您可以继续以原始字节形式读取输入,并找到某种方法使用已知编码来解码这些字节。CPAN 可以帮助解决这个问题。

To make sure you are producing output in UTF-8, apply the utf8 layer to the output stream using binmode

open FILE, '>output.html';
binmode FILE, ':utf8';

or in the 3-argument open call

open FILE, '>:utf8', 'output.html'

Arbitrary input is trickier. If you are lucky, HTML input will tell you its encoding early on:

wget http://www.google.com/ -O foo ; head -1 foo

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; 
charset=ISO-8859-1"><title>Google</title><script>window.google=
{kEI:"xgngTYnYIoPbgQevid3cCg",kEXPI:"23933,28505,29134,29229,29658,
29695,29795,29822,29892,30111,30174,30215,30275,30562",kCSI:
{e:"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562",ei:"xgngTYnYIoPbgQevid3cCg",expi:
"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
30174,30215,30275,30562"},authuser:0,ml:function(){},kHL:"en",
time:function(){return(new Date).getTime()},

Ah, there it is: <meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1
">
. Now you may continue to read input as raw bytes and find some way to decode those bytes with the known encoding. CPAN can help with this.

放肆 2024-11-17 07:37:18

我指的是您问题的更新部分(下次为单独的主题打开一个新问题)。这是文件的十六进制转储(将来请不要让帮助者跳过燃烧的圆圈来获取示例数据):

0000  3c 78 6d 6c 3e 0d 0a 3c  70 65 72 73 6f 6e 4e 61  <xml>␍␤< personNa
0010  6d 65 3e 47 2e 20 50 65  74 65 72 20 44 61 80 41  me>G. Pe ter Da�A
0020  6c 6f 69 61 3c 2f 70 65  72 73 6f 6e 4e 61 6d 65  loia</pe rsonName
0030  3e 0d 0a 3c 2f 78 6d 6c  3e 0d 0a                 >␍␤</xml >␍␤

您说您知道该字符应该是 ',但它完全得到了损坏了。在任何编码中都不能是 0x80。这看起来像是一次粘贴事故,您在编辑器/剪贴板之间传输数据而不是仅处理文件。如果情况并非如此,那么您的牛人就犯了一个您无法通过算法纠正的错误。

I am referring to the updated part of your question (next time open a new one for a separate topic). This is a hex dump of your file (please refrain in the future from making helpers jump through burning hoops to get at your example data):

0000  3c 78 6d 6c 3e 0d 0a 3c  70 65 72 73 6f 6e 4e 61  <xml>␍␤< personNa
0010  6d 65 3e 47 2e 20 50 65  74 65 72 20 44 61 80 41  me>G. Pe ter Da�A
0020  6c 6f 69 61 3c 2f 70 65  72 73 6f 6e 4e 61 6d 65  loia</pe rsonName
0030  3e 0d 0a 3c 2f 78 6d 6c  3e 0d 0a                 >␍␤</xml >␍␤

You said you know the character should be , but it got totally mangled. It can't be 0x80 in any encoding. This looks like a paste accident where you transferred data between editors/clipboards instead of dealing with just files. If that's not the case, then your cow orker produced a wrong you are not able to right algorithmically.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文