如何告诉 Nokogiri 在解析文档时不要将其转换为不同的编码（在我的例子中不要将 &paund; 转换为其他任何内容）

发布于 2024-09-10 17:46:57 字数 1483 浏览 3 评论 0原文

如何告诉 Nokogiri 不要将文档转换为其他编码，在我的例子中不要将 &paund; 转换为其他编码？

我有一个文件包含：

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body>
<span>&pound;</span>
</body>
</html>

我用 Nokogiri 解析它：

d = Nokogiri::HTML.parse(open('/tmp/in.html', 'r'))

如果我打印文档“d”，我得到：

<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n
<html>\n
<head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\"></head>\n
<body>\n
<span>\302\243</span>\n
</body>\n
</html>\n

注意：£ 变成了“\302\243”（或以 ISO-8859-1 编码的 £ 变为以 UTF-8 编码）

如果我将文档“d”保存到文件中：

open('/tmp/out.html', 'w') do |out|
out << d.to_html
end

我得到以下信息：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></head>
<body>
<span>ВЈ</span>
</body>
</html>

使用 & 解析文档后;paund;，并将其保存到文件中，我得到了两个符号，而不是“BJ”。

我想我没有在某个步骤指定编码，但我不确定在哪里。

原文

How do I tell Nokogiri not to convert a document to a different encoding, in my case not to convert &paund; to to anything else?

I have a file containing:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body>
<span>£</span>
</body>
</html>

I parse it with Nokogiri:

d = Nokogiri::HTML.parse(open('/tmp/in.html', 'r'))

If I print document "d" I get:

<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n
<html>\n
<head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\"></head>\n
<body>\n
<span>\302\243</span>\n
</body>\n
</html>\n

Note: £ became "\302\243" (or £ that was encoded in ISO-8859-1 became encoded in UTF-8)

If I save document "d" to a file:

open('/tmp/out.html', 'w') do |out|
out << d.to_html
end

I get the following:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></head>
<body>
<span>ВЈ</span>
</body>
</html>

After parsing the document with &paund;, and saving it to a file, I got two symbols instead "BJ".

I think I am not specifying encoding at some step, but I am not sure where.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里人 2024-09-17 17:46:57

Nokogiri 文档中“解析”的定义，查找编码：

# File lib/nokogiri/html.rb, line 22

22:       def parse thing, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML, &block
23:         Document.parse(thing, url, encoding, options, &block)
24:       end

Definition of 'parse' from Nokogiri from documentation, look for encoding:

# File lib/nokogiri/html.rb, line 22

22:       def parse thing, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML, &block
23:         Document.parse(thing, url, encoding, options, &block)
24:       end

回复收藏 0 原文

~没有更多了~