未封闭的标签和 Nokogiri
我的测试html文件在这里: http://pastebin.com/L88nYbQY
正如你所看到的,有一些未关闭的输入标签,以及一些自动关闭的标签。
这会导致以下代码返回从开始 #qcbody div 到文件末尾的所有内容,忽略结束 div 标记。
require 'nokogiri'
f = File.open('t.html', 'r')
@doc = Nokogiri::XML(f)
@doc.at_css('#qcbody').to_html
我确信人们已经通过多种方式解决了这个问题。你会怎么做?
My test html file is here: http://pastebin.com/L88nYbQY
As you can see there are some unclosed input tags, and some self closing ones.
This causes the following code to return everything from the opening #qcbody div to the end of the file, ignoring the closing div tag.
require 'nokogiri'
f = File.open('t.html', 'r')
@doc = Nokogiri::XML(f)
@doc.at_css('#qcbody').to_html
I'm sure people have gotten around this problem in a variety of ways. How would you do it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尝试一下:
在 IRB 中:
使用
Nokogiri::XML
和Nokogiri::HTML
之间的区别在于解析文档时的宽松程度。 XML 需要进行验证并保持正确。某些 XML 解析器会拒绝不符合标准的 XML 文件。 Nokogiri 允许我们设置它的挑剔程度。 (对于 XML,您可以查看错误
解析后的数组以查看是否存在问题。)对于 HTML,Nokogiri 放松了解析器,因此有更好的机会处理现实世界的 HTML。我见过它处理一些非常丑陋的标记,并且当较小的解析器搞砸了他们的午餐时仍然继续下去。如果您查看
Nokogiri::HTML.parse
它有options = XML::ParseOptions::DEFAULT_HTML
定义,它们是轻松的设置。如果您想确保 HTML 符合要求,您可以覆盖它。Give this a try:
In IRB:
The difference between using
Nokogiri::XML
andNokogiri::HTML
is the leniency when parsing the document. XML is required to validate and be correct. Some XML parsers would reject an XML file that doesn't meet the standard. Nokogiri allows us to set how picky it is. (And in the case of XML, you can look at theerrors
array after parsing to see if there is a problem.)For HTML, Nokogiri relaxes the parser so there's a better chance of handling real-world HTML. I've seen it handle some really ugly markup and keep on going when lesser parsers blew their lunch. If you look at
Nokogiri::HTML.parse
it hasoptions = XML::ParseOptions::DEFAULT_HTML
defined, which are the relaxed settings. You can override that if you want to make sure the HTML conforms.