使用 Nokogiri 解析包含从 Delicious.com 导出的链接的文件的最佳方法?
我想解析一个包含从 Delicious 导出的链接的 html 文件。我正在使用 Nokogiri 进行解析。该文件具有以下结构:
<DT>
<A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/"
ADD_DATE="1233132422"
PRIVATE="0"
TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
<A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html"
ADD_DATE="1226827542"
PRIVATE="0"
TAGS="irw_20">Minority Report Interface</A>
<DT>
<A HREF="http://www.windowshop.com/"
ADD_DATE="1225267658"
PRIVATE="0"
TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon
如您所见,链接信息位于 DT 标签中,某些链接在 DD 标签中具有注释。
我执行以下操作来获取链接信息:
doc.xpath('//dt//a').each do |node|
title = node.text
url = node['href']
tags = node['tags']
puts "#{title}, #{url}, #{tags}"
end
我的问题是,当存在 dd 标签时,如何获取链接信息和注释?
I want to parse an html file containing links exported from Delicious. I am using Nokogiri for the parsing. The file has the following structure:
<DT>
<A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/"
ADD_DATE="1233132422"
PRIVATE="0"
TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
<A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html"
ADD_DATE="1226827542"
PRIVATE="0"
TAGS="irw_20">Minority Report Interface</A>
<DT>
<A HREF="http://www.windowshop.com/"
ADD_DATE="1225267658"
PRIVATE="0"
TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon
As you can see the link information is in the DT-tag and some links have a comment in a DD-tag.
I do the following to get the link information:
doc.xpath('//dt//a').each do |node|
title = node.text
url = node['href']
tags = node['tags']
puts "#{title}, #{url}, #{tags}"
end
My question is how do I get the link information AND the comment when a dd tag is present?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用:
这将选择具有
DT
父级的所有a
元素以及紧随其后的同级元素的所有DD
元素具有a
子元素的DT
元素的元素。注意:强烈建议不要使用
//
,因为它通常会导致开发人员使用效率低下和异常。只要 XML 文档的结构已知,就避免使用
//
缩写。Use:
This selects all
a
elements that have aDT
parent and allDD
elements that are the immediate following sibling element of aDT
element that has ana
child.Note: The use of the
//
is strongly discouraged because it usually leads to inefficiencies and anomalies in its use for the developers.Whenever the structure of the XML document is known, avoid using the
//
abbreviation.你的问题不清楚你在寻找什么。
首先,HTML 格式错误,因为
标记未正确关闭,并且第一个
a
标记的文本中存在 Ruby 1.9.2 不存在的非法字符不喜欢,因为它不是 UTF-8。我在 TextMate 中将字符转换为实体。在尝试修复它之后,该 HTML 在 Nokogiri 中解析为:
请注意,结束
dt
标记是如何在唯一的dd
标记之前分组的?这很烦人,但没关系,因为它不会改变我们寻找 dd 内容的方式。这意味着,找到
后跟
dt
后跟a
后跟dd
,因为这不是 HTML 解析的方式。它实际上是dt
后跟dd
,这就是“dt + dd
”的含义。您的问题看起来的另一种方式是您正在寻找
a
标签的内容:Your question isn't clear about what you are looking for.
First, the HTML is malformed because the
<DT>
tags are not closed correctly, and there is an illegal character in the firsta
tag's text that Ruby 1.9.2 doesn't like because it's not UTF-8. I converted the character to an entity in TextMate.That HTML parses to this in Nokogiri after it tries to fix it up:
Notice how the closing
dt
tags are grouped just before the onlydd
tag? That's icky, but ok because it doesn't change how we have to look for thedd
content.That means, find
<dt>
followed by<dd>
. You don't/can't look fordt
followed bya
followed bydd
because that's not how the HTML parses. It would really bedt
followed bydd
, which is what "dt + dd
" means.The other way it seemed like your question could read was that you were looking for the content of the
a
tags:我假设:
有一个结束 /DD 标签,我无法仅从您的页面片段中看出。如果是这样,您可以这样做:
您需要调用 next_sibling 两次,因为第一个将匹配 \n (换行符)或空格。您可以在解析页面之前删除所有新行以避免重复调用。如果 DT 标记后有超过 1 个换行符,这也可能是一个好主意
I'm assuming the:
has an ending /DD tag, I can't tell from just your snippet of the page. If so, you could do:
You need to call next_sibling twice because the first one will match a \n (new line) or whitespace. You could remove all the new lines prior to parsing the page to avoid the double call. That might also be a good idea in case there's more than 1 new line character after the DT tag