如何告诉 Nokogiri 在解析文档时不要将其转换为不同的编码(在我的例子中不要将 &paund; 转换为其他任何内容)
如何告诉 Nokogiri 不要将文档转换为其他编码,在我的例子中不要将 &paund;
转换为其他编码?
我有一个文件包含:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body>
<span>£</span>
</body>
</html>
我用 Nokogiri 解析它:
d = Nokogiri::HTML.parse(open('/tmp/in.html', 'r'))
如果我打印文档“d
”,我得到:
<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n
<html>\n
<head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\"></head>\n
<body>\n
<span>\302\243</span>\n
</body>\n
</html>\n
注意:£
变成了“\302\243”(或以 ISO-8859-1 编码的 £
变为以 UTF-8 编码)
如果我将文档“d”保存到文件中:
open('/tmp/out.html', 'w') do |out|
out << d.to_html
end
我得到以下信息:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></head>
<body>
<span>ВЈ</span>
</body>
</html>
使用 & 解析文档后;paund;
,并将其保存到文件中,我得到了两个符号,而不是“BJ
”。
我想我没有在某个步骤指定编码,但我不确定在哪里。
How do I tell Nokogiri not to convert a document to a different encoding, in my case not to convert &paund;
to to anything else?
I have a file containing:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body>
<span>£</span>
</body>
</html>
I parse it with Nokogiri:
d = Nokogiri::HTML.parse(open('/tmp/in.html', 'r'))
If I print document "d
" I get:
<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n
<html>\n
<head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\"></head>\n
<body>\n
<span>\302\243</span>\n
</body>\n
</html>\n
Note: £
became "\302\243" (or £
that was encoded in ISO-8859-1 became encoded in UTF-8)
If I save document "d" to a file:
open('/tmp/out.html', 'w') do |out|
out << d.to_html
end
I get the following:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></head>
<body>
<span>ВЈ</span>
</body>
</html>
After parsing the document with &paund;
, and saving it to a file, I got two symbols instead "BJ
".
I think I am not specifying encoding at some step, but I am not sure where.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Nokogiri 文档中“解析”的定义,查找编码:
Definition of 'parse' from Nokogiri from documentation, look for encoding: