Nokogiri 保持 HTML 实体不变

发布于 2024-12-09 19:38:04 字数 382 浏览 0 评论 0原文

我希望 Nokogiri 保持 HTML 实体不变，但它似乎正在将实体转换为实际的符号。例如：

 Nokogiri::HTML.fragment('<p>&reg;</p>').to_s

结果为： "

®

"

似乎没有什么可以将原始 HTML 返回给我。 .inner_html、.text、.content 方法都返回 '®' 而不是 '®'

Nokogiri 有没有办法让这些 HTML 实体保持不变？

我已经搜索过 stackoverflow 并发现了类似的问题，但没有一个与这个完全一样。

原文

I want Nokogiri to leave HTML entities untouched, but it seems to be converting the entities into the actual symbol. For example:

 Nokogiri::HTML.fragment('<p>®</p>').to_s

results in: "<p>®</p>"

Nothing seems to return the original HTML back to me.
The .inner_html, .text, .content methods all return '®' instead of '®'

Is there a way for Nokogiri to leave these HTML entities untouched?

I've already searched stackoverflow and found similar questions, but nothing exactly like this one.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苍景流年 2024-12-16 19:38:04

这不是一个理想的答案，但您可以通过设置允许的编码来强制它生成实体（如果不是好名字）：

#encoding: UTF-8
require 'nokogiri'
html = Nokogiri::HTML.fragment('<p>®</p>')
puts html.to_html                              #=> <p>®</p>
puts html.to_html( encoding:'US-ASCII' )       #=> <p>®</p>

如果 Nokogiri 使用定义的实体的“好”名称而不是总是使用简洁的十六进制实体，那就太好了，但即使这样也不能“保留”原作。

问题的根源在于，在 HTML 中，以下内容都描述了完全相同的内容：

<p>®</p>
<p>®</p>
<p>®</p>  
<p>®</p>

如果您希望文本节点的 to_s 表示实际上是 ® 那么描述它的标记实际上是：

®

。

如果 Nokogiri 始终返回与输入文档时使用的每个字符相同的编码，则需要将每个字符存储为记录实体引用的自定义节点。存在一个可能用于此目的的类（Nokogiri::XML::EntityReference)：

require 'nokogiri'
html = Nokogiri::HTML.fragment("<p>Foo</p>")
html.at('p') << Nokogiri::XML::EntityReference.new( html.document, 'reg' )
puts html
#=> <p>Foo®</p>

但是，我找不到一种方法可以在使用 Nokogiri v1.4.4 或 v1.5.0 解析期间创建这些内容。具体来说，是否存在 Nokogiri::XML::ParseOptions::NOENT 在解析期间似乎不会导致创建一个：

require 'nokogiri'
html = "<p>Foo®</p>"
[ Nokogiri::XML::ParseOptions::NOENT,
  Nokogiri::XML::ParseOptions::DEFAULT_HTML,
  Nokogiri::XML::ParseOptions::DEFAULT_XML,
  Nokogiri::XML::ParseOptions::STRICT
].each do |parse_option|
  p Nokogiri::HTML(html,nil,'utf-8',parse_option).at('//text()')
end
#=> #<Nokogiri::XML::Text:0x810cca48 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc624 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc228 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cbe04 "Foo\u00AE">

Not an ideal answer, but you can force it to generate entities (if not nice names) by setting the allowed encoding:

#encoding: UTF-8
require 'nokogiri'
html = Nokogiri::HTML.fragment('<p>®</p>')
puts html.to_html                              #=> <p>®</p>
puts html.to_html( encoding:'US-ASCII' )       #=> <p>®</p>

It would be nice if Nokogiri used 'nice' names of entities where defined, instead of always using the terse hexadecimal entity, but even that wouldn't be 'preserving' the original.

The root of the problem is that, in HTML, the following all describe the exact same content:

<p>®</p>
<p>®</p>
<p>®</p>  
<p>®</p>

If you wanted the to_s representation of a text node to be actually ® then the markup describing that would really be: <p>®</p>.

If Nokogiri was to always return the same encoding per character as was used to enter the document it would need to store each character as a custom node recording the entity reference. There exists a class that might be used for this (Nokogiri::XML::EntityReference):

require 'nokogiri'
html = Nokogiri::HTML.fragment("<p>Foo</p>")
html.at('p') << Nokogiri::XML::EntityReference.new( html.document, 'reg' )
puts html
#=> <p>Foo®</p>

However, I can't find a way to cause these to be created during parsing using Nokogiri v1.4.4 or v1.5.0. Specifically, the presence or absence of Nokogiri::XML::ParseOptions::NOENT during parsing does not appear to cause one to be created:

require 'nokogiri'
html = "<p>Foo®</p>"
[ Nokogiri::XML::ParseOptions::NOENT,
  Nokogiri::XML::ParseOptions::DEFAULT_HTML,
  Nokogiri::XML::ParseOptions::DEFAULT_XML,
  Nokogiri::XML::ParseOptions::STRICT
].each do |parse_option|
  p Nokogiri::HTML(html,nil,'utf-8',parse_option).at('//text()')
end
#=> #<Nokogiri::XML::Text:0x810cca48 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc624 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc228 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cbe04 "Foo\u00AE">

回复收藏 0 原文

~没有更多了~