如何使用 Nokogiri 漂亮地打印 HTML?

发布于 2024-08-15 13:10:41 字数 213 浏览 8 评论 0原文

我用 Ruby 编写了一个网络爬虫,并使用 Nokogiri::HTML 来解析页面。我需要打印页面,在 IRB 中闲逛时,我注意到一个 pretty_print 方法。然而它需要一个参数,我不知道它想要什么。

我的爬虫正在缓存网页的 HTML 并将其写入本地计算机上的文件。我想“漂亮地打印”HTML,以便在我这样做时它看起来漂亮且格式正确。

I wrote a web crawler in Ruby and I'm using Nokogiri::HTML to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print method. However it takes a parameter and I can't figure out what it wants.

My crawler is caching the HTML of the webpages and writing it to files on my local machine. I would like to "pretty print" the HTML so that it looks nice and properly formatted when I do so.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

南城旧梦 2024-08-22 13:10:41

@mislav 的回答有些错误。 Nokogiri 确实支持漂亮打印,如果您:

  • 将文档解析为 XML
  • 指示 Nokogiri 在解析过程中忽略纯空白节点(“空白”)
  • 使用 to_xhtmlto_xml 指定漂亮打印参数

行动中:

html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'

require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=>   <h1>Main Section 1</h1>
#=>   <p>Intro</p>
#=>   <section>
#=>     <h2>Subhead 1.1</h2>
#=>     <p>Meat</p>
#=>     <p>MOAR MEAT</p>
#=>   </section>
#=>   <section>
#=>     <h2>Subhead 1.2</h2>
#=>     <p>Meat</p>
#=>   </section>
#=> </section>

puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>

The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing if you:

  • Parse the document as XML
  • Instruct Nokogiri to ignore whitespace-only nodes ("blanks") during parsing
  • Use to_xhtml or to_xml to specify pretty-printing parameters

In action:

html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'

require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=>   <h1>Main Section 1</h1>
#=>   <p>Intro</p>
#=>   <section>
#=>     <h2>Subhead 1.1</h2>
#=>     <p>Meat</p>
#=>     <p>MOAR MEAT</p>
#=>   </section>
#=>   <section>
#=>     <h2>Subhead 1.2</h2>
#=>     <p>Meat</p>
#=>   </section>
#=> </section>

puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>
内心激荡 2024-08-22 13:10:41

通过 HTML 页面的“漂亮打印”,我认为您的意思是您想要使用适当的缩进重新格式化 HTML 结构。 Nokogiri 不支持这一点; pretty_print 方法适用于“pp”库,输出仅对调试有用。

有几个项目能够很好地理解 HTML,能够在不破坏实际上很重要的空白的情况下重新格式化它(著名的一个是 HTML Tidy),但通过谷歌搜索,我发现这篇文章标题为 "使用 Nokogiri 和 XSLT 漂亮地打印 XHTML"

归结为:

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

当然,它要求您将链接的 XSL 文件下载到您的文件系统。我在我的机器上很快就尝试过了,效果非常好。

By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the pretty_print method is for the "pp" library and the output is useful for debugging only.

There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".

It comes down to this:

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.

束缚m 2024-08-22 13:10:41

这对我有用:

 pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3) 

我尝试了上面的 REXML 版本,但它损坏了我的一些文档。我讨厌将 xslt 引入新项目。两者都感觉陈旧。 :)

This worked for me:

 pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3) 

I tried the REXML version above, but it corrupted some of my documents. And I hate to bring xslt into a new project. Both feel antiquated. :)

↙温凉少女 2024-08-22 13:10:41

您可以尝试 REXML:

require "rexml/document"

doc = REXML::Document.new(xml)
doc.write($stdout, 2)

You can try REXML:

require "rexml/document"

doc = REXML::Document.new(xml)
doc.write($stdout, 2)
旧人九事 2024-08-22 13:10:41

我的解决方案是在实际的 Nokogiri 对象上添加一个 print 方法。运行下面代码片段中的代码后,您应该能够编写 node.print,并且它会很好地打印内容。不需要 xslt :-)

Nokogiri::XML::Node.class_eval do
  # Print every Node by default (will be overridden by CharacterData)
  define_method :should_print? do
    true
  end

  # Duplicate this node, replace the contents of the duplicated node with a
  # newline. With this content substitution, the #to_s method conveniently
  # returns a string with the opening tag (e.g. `<a href="foo">`) on the first
  # line and the closing tag on the second (e.g. `</a>`, provided that the
  # current node is not a self-closing tag).
  #
  # Now, print the open tag preceded by the correct amount of indentation, then
  # recursively print this node's children (with extra indentation), and then
  # print the close tag (if there is a closing tag)
  define_method :print do |indent=0|
    duplicate = self.dup
    duplicate.content = "\n"
    open_tag, close_tag = duplicate.to_s.split("\n")

    puts (" " * indent) + open_tag
    self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
    puts (" " * indent) + close_tag if close_tag
  end
end

Nokogiri::XML::CharacterData.class_eval do
  # Only print CharacterData if there's non-whitespace content
  define_method :should_print? do
    content =~ /\S+/
  end

  # Replace all consecutive whitespace characters by a single space; precede the
  # outut by a certain amount of indentation; print this text.
  define_method :print do |indent=0|
    puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
  end
end

My solution was to add a print method onto the actual Nokogiri objects. After you run the code in the snippet below, you should just be able to write node.print, and it'll pretty print the contents. No xslt required :-)

Nokogiri::XML::Node.class_eval do
  # Print every Node by default (will be overridden by CharacterData)
  define_method :should_print? do
    true
  end

  # Duplicate this node, replace the contents of the duplicated node with a
  # newline. With this content substitution, the #to_s method conveniently
  # returns a string with the opening tag (e.g. `<a href="foo">`) on the first
  # line and the closing tag on the second (e.g. `</a>`, provided that the
  # current node is not a self-closing tag).
  #
  # Now, print the open tag preceded by the correct amount of indentation, then
  # recursively print this node's children (with extra indentation), and then
  # print the close tag (if there is a closing tag)
  define_method :print do |indent=0|
    duplicate = self.dup
    duplicate.content = "\n"
    open_tag, close_tag = duplicate.to_s.split("\n")

    puts (" " * indent) + open_tag
    self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
    puts (" " * indent) + close_tag if close_tag
  end
end

Nokogiri::XML::CharacterData.class_eval do
  # Only print CharacterData if there's non-whitespace content
  define_method :should_print? do
    content =~ /\S+/
  end

  # Replace all consecutive whitespace characters by a single space; precede the
  # outut by a certain amount of indentation; print this text.
  define_method :print do |indent=0|
    puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
  end
end
如若梦似彩虹 2024-08-22 13:10:41

更简单且效果更好

puts Nokogiri::HTML(File.read('terms.fr.html')).to_xhtml

Simpler and works well

puts Nokogiri::HTML(File.read('terms.fr.html')).to_xhtml
长亭外,古道边 2024-08-22 13:10:41

我知道我回答这个问题已经很晚了,但我还是会留下答案。我尝试了上述所有步骤,并且在一定程度上确实有效。

Nokogiri 确实格式化了 HTML 但不关心结束或开始标记,因此漂亮的格式是不可能的。

我发现了一个名为 htmlbeautifier 的宝石,它的作用就像一个魅力。我希望其他仍在寻找答案的人会发现这很有价值。

I know I am extremely late to answer this question, but still, I'll leave the answer. I tried all the above steps and it does work to an extent.

Nokogiri does format the HTML but does not care about the closing or the opening tag, hence pretty format is out of the picture.

I found a gem called htmlbeautifier that works like a charm. I hope other people who are still searching for the answer will find this valuable.

不必了 2024-08-22 13:10:41

你为什么不试试pp方法呢?

require 'pp'
pp some_var

why don't you try the pp method?

require 'pp'
pp some_var
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文