当前位置：文江博客话题详情

如何使用 Nokogiri 漂亮地打印 HTML？

发布于 2024-08-15 13:10:41 字数 213 浏览 8 评论 0原文

我用 Ruby 编写了一个网络爬虫，并使用 Nokogiri::HTML 来解析页面。我需要打印页面，在 IRB 中闲逛时，我注意到一个 pretty_print 方法。然而它需要一个参数，我不知道它想要什么。

我的爬虫正在缓存网页的 HTML 并将其写入本地计算机上的文件。我想“漂亮地打印”HTML，以便在我这样做时它看起来漂亮且格式正确。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

南城旧梦 2024-08-22 13:10:41

@mislav 的回答有些错误。 Nokogiri 确实支持漂亮打印，如果您：

将文档解析为 XML
指示 Nokogiri 在解析过程中忽略纯空白节点（“空白”）
使用 to_xhtml 或 to_xml 指定漂亮打印参数

行动中：

html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'

require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=>   <h1>Main Section 1</h1>
#=>   <p>Intro</p>
#=>   <section>
#=>     <h2>Subhead 1.1</h2>
#=>     <p>Meat</p>
#=>     <p>MOAR MEAT</p>
#=>   </section>
#=>   <section>
#=>     <h2>Subhead 1.2</h2>
#=>     <p>Meat</p>
#=>   </section>
#=> </section>

puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>

The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing if you:

Parse the document as XML
Instruct Nokogiri to ignore whitespace-only nodes ("blanks") during parsing
Use to_xhtml or to_xml to specify pretty-printing parameters

In action:

html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'

require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=>   <h1>Main Section 1</h1>
#=>   <p>Intro</p>
#=>   <section>
#=>     <h2>Subhead 1.1</h2>
#=>     <p>Meat</p>
#=>     <p>MOAR MEAT</p>
#=>   </section>
#=>   <section>
#=>     <h2>Subhead 1.2</h2>
#=>     <p>Meat</p>
#=>   </section>
#=> </section>

puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>

回复收藏 0 原文

内心激荡 2024-08-22 13:10:41

通过 HTML 页面的“漂亮打印”，我认为您的意思是您想要使用适当的缩进重新格式化 HTML 结构。 Nokogiri 不支持这一点； pretty_print 方法适用于“pp”库，输出仅对调试有用。

有几个项目能够很好地理解 HTML，能够在不破坏实际上很重要的空白的情况下重新格式化它（著名的一个是 HTML Tidy），但通过谷歌搜索，我发现这篇文章标题为 "使用 Nokogiri 和 XSLT 漂亮地打印 XHTML"。

归结为：

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

当然，它要求您将链接的 XSL 文件下载到您的文件系统。我在我的机器上很快就尝试过了，效果非常好。

By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the pretty_print method is for the "pp" library and the output is useful for debugging only.

There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".

It comes down to this:

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.

回复收藏 0 原文

束缚ｍ 2024-08-22 13:10:41

这对我有用：

 pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3)

我尝试了上面的 REXML 版本，但它损坏了我的一些文档。我讨厌将 xslt 引入新项目。两者都感觉陈旧。 :)

This worked for me:

 pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3)

I tried the REXML version above, but it corrupted some of my documents. And I hate to bring xslt into a new project. Both feel antiquated. :)

回复收藏 0 原文

↙温凉少女 2024-08-22 13:10:41

您可以尝试 REXML：

require "rexml/document"

doc = REXML::Document.new(xml)
doc.write($stdout, 2)

You can try REXML:

require "rexml/document"

doc = REXML::Document.new(xml)
doc.write($stdout, 2)

回复收藏 0 原文

旧人九事 2024-08-22 13:10:41

我的解决方案是在实际的 Nokogiri 对象上添加一个 print 方法。运行下面代码片段中的代码后，您应该能够编写 node.print，并且它会很好地打印内容。不需要 xslt :-)

Nokogiri::XML::Node.class_eval do
  # Print every Node by default (will be overridden by CharacterData)
  define_method :should_print? do
    true
  end

  # Duplicate this node, replace the contents of the duplicated node with a
  # newline. With this content substitution, the #to_s method conveniently
  # returns a string with the opening tag (e.g. `<a href="foo">`) on the first
  # line and the closing tag on the second (e.g. `</a>`, provided that the
  # current node is not a self-closing tag).
  #
  # Now, print the open tag preceded by the correct amount of indentation, then
  # recursively print this node's children (with extra indentation), and then
  # print the close tag (if there is a closing tag)
  define_method :print do |indent=0|
    duplicate = self.dup
    duplicate.content = "\n"
    open_tag, close_tag = duplicate.to_s.split("\n")

    puts (" " * indent) + open_tag
    self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
    puts (" " * indent) + close_tag if close_tag
  end
end

Nokogiri::XML::CharacterData.class_eval do
  # Only print CharacterData if there's non-whitespace content
  define_method :should_print? do
    content =~ /\S+/
  end

  # Replace all consecutive whitespace characters by a single space; precede the
  # outut by a certain amount of indentation; print this text.
  define_method :print do |indent=0|
    puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
  end
end

My solution was to add a print method onto the actual Nokogiri objects. After you run the code in the snippet below, you should just be able to write node.print, and it'll pretty print the contents. No xslt required :-)

Nokogiri::XML::Node.class_eval do
  # Print every Node by default (will be overridden by CharacterData)
  define_method :should_print? do
    true
  end

  # Duplicate this node, replace the contents of the duplicated node with a
  # newline. With this content substitution, the #to_s method conveniently
  # returns a string with the opening tag (e.g. `<a href="foo">`) on the first
  # line and the closing tag on the second (e.g. `</a>`, provided that the
  # current node is not a self-closing tag).
  #
  # Now, print the open tag preceded by the correct amount of indentation, then
  # recursively print this node's children (with extra indentation), and then
  # print the close tag (if there is a closing tag)
  define_method :print do |indent=0|
    duplicate = self.dup
    duplicate.content = "\n"
    open_tag, close_tag = duplicate.to_s.split("\n")

    puts (" " * indent) + open_tag
    self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
    puts (" " * indent) + close_tag if close_tag
  end
end

Nokogiri::XML::CharacterData.class_eval do
  # Only print CharacterData if there's non-whitespace content
  define_method :should_print? do
    content =~ /\S+/
  end

  # Replace all consecutive whitespace characters by a single space; precede the
  # outut by a certain amount of indentation; print this text.
  define_method :print do |indent=0|
    puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
  end
end

回复收藏 0 原文

如若梦似彩虹 2024-08-22 13:10:41

更简单且效果更好

puts Nokogiri::HTML(File.read('terms.fr.html')).to_xhtml

Simpler and works well

puts Nokogiri::HTML(File.read('terms.fr.html')).to_xhtml

回复收藏 0 原文

长亭外，古道边 2024-08-22 13:10:41

我知道我回答这个问题已经很晚了，但我还是会留下答案。我尝试了上述所有步骤，并且在一定程度上确实有效。

Nokogiri 确实格式化了 HTML 但不关心结束或开始标记，因此漂亮的格式是不可能的。

我发现了一个名为 htmlbeautifier 的宝石，它的作用就像一个魅力。我希望其他仍在寻找答案的人会发现这很有价值。

回复收藏 0 原文

不必了 2024-08-22 13:10:41

你为什么不试试pp方法呢？

require 'pp'
pp some_var

why don't you try the pp method?

require 'pp'
pp some_var

回复收藏 0 原文

~没有更多了~

关于作者

夏の忆

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何使用 Nokogiri 漂亮地打印 HTML？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

如何使用 Nokogiri 漂亮地打印 HTML？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

Promise

qq_lbRlsh

待＂谢繁草

yy2010hell

漫无边际

傲娇萝莉攻

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。