如何使用 Nokogiri 漂亮地打印 HTML?
我用 Ruby 编写了一个网络爬虫,并使用 Nokogiri::HTML
来解析页面。我需要打印页面,在 IRB 中闲逛时,我注意到一个 pretty_print
方法。然而它需要一个参数,我不知道它想要什么。
我的爬虫正在缓存网页的 HTML 并将其写入本地计算机上的文件。我想“漂亮地打印”HTML,以便在我这样做时它看起来漂亮且格式正确。
I wrote a web crawler in Ruby and I'm using Nokogiri::HTML
to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print
method. However it takes a parameter and I can't figure out what it wants.
My crawler is caching the HTML of the webpages and writing it to files on my local machine. I would like to "pretty print" the HTML so that it looks nice and properly formatted when I do so.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
@mislav 的回答有些错误。 Nokogiri 确实支持漂亮打印,如果您:
to_xhtml
或to_xml
指定漂亮打印参数行动中:
The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing if you:
to_xhtml
orto_xml
to specify pretty-printing parametersIn action:
通过 HTML 页面的“漂亮打印”,我认为您的意思是您想要使用适当的缩进重新格式化 HTML 结构。 Nokogiri 不支持这一点;
pretty_print
方法适用于“pp”库,输出仅对调试有用。有几个项目能够很好地理解 HTML,能够在不破坏实际上很重要的空白的情况下重新格式化它(著名的一个是 HTML Tidy),但通过谷歌搜索,我发现这篇文章标题为 "使用 Nokogiri 和 XSLT 漂亮地打印 XHTML"。
归结为:
当然,它要求您将链接的 XSL 文件下载到您的文件系统。我在我的机器上很快就尝试过了,效果非常好。
By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the
pretty_print
method is for the "pp" library and the output is useful for debugging only.There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".
It comes down to this:
It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.
这对我有用:
我尝试了上面的 REXML 版本,但它损坏了我的一些文档。我讨厌将 xslt 引入新项目。两者都感觉陈旧。 :)
This worked for me:
I tried the REXML version above, but it corrupted some of my documents. And I hate to bring xslt into a new project. Both feel antiquated. :)
您可以尝试 REXML:
You can try REXML:
我的解决方案是在实际的
Nokogiri
对象上添加一个print
方法。运行下面代码片段中的代码后,您应该能够编写node.print
,并且它会很好地打印内容。不需要 xslt :-)My solution was to add a
print
method onto the actualNokogiri
objects. After you run the code in the snippet below, you should just be able to writenode.print
, and it'll pretty print the contents. No xslt required :-)更简单且效果更好
Simpler and works well
我知道我回答这个问题已经很晚了,但我还是会留下答案。我尝试了上述所有步骤,并且在一定程度上确实有效。
Nokogiri
确实格式化了HTML
但不关心结束或开始标记,因此漂亮的格式是不可能的。我发现了一个名为 htmlbeautifier 的宝石,它的作用就像一个魅力。我希望其他仍在寻找答案的人会发现这很有价值。
I know I am extremely late to answer this question, but still, I'll leave the answer. I tried all the above steps and it does work to an extent.
Nokogiri
does format theHTML
but does not care about the closing or the opening tag, hence pretty format is out of the picture.I found a gem called htmlbeautifier that works like a charm. I hope other people who are still searching for the answer will find this valuable.
你为什么不试试
pp
方法呢?why don't you try the
pp
method?