使用 Nokogiri 解析包含从 Delicious.com 导出的链接的文件的最佳方法?

发布于 2024-10-08 09:28:40 字数 975 浏览 12 评论 0原文

我想解析一个包含从 Delicious 导出的链接的 html 文件。我正在使用 Nokogiri 进行解析。该文件具有以下结构:

<DT>
   <A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/"
      ADD_DATE="1233132422"
      PRIVATE="0"
      TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
   <A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" 
      ADD_DATE="1226827542" 
      PRIVATE="0" 
      TAGS="irw_20">Minority Report Interface</A>
<DT>
   <A HREF="http://www.windowshop.com/" 
      ADD_DATE="1225267658" 
      PRIVATE="0" 
      TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon

如您所见,链接信息位于 DT 标签中,某些链接在 DD 标签中具有注释。

我执行以下操作来获取链接信息:

doc.xpath('//dt//a').each do |node|
  title = node.text
  url = node['href']
  tags = node['tags']
  puts "#{title}, #{url}, #{tags}"
end

我的问题是,当存在 dd 标签时,如何获取链接信息和注释?

I want to parse an html file containing links exported from Delicious. I am using Nokogiri for the parsing. The file has the following structure:

<DT>
   <A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/"
      ADD_DATE="1233132422"
      PRIVATE="0"
      TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
   <A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" 
      ADD_DATE="1226827542" 
      PRIVATE="0" 
      TAGS="irw_20">Minority Report Interface</A>
<DT>
   <A HREF="http://www.windowshop.com/" 
      ADD_DATE="1225267658" 
      PRIVATE="0" 
      TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon

As you can see the link information is in the DT-tag and some links have a comment in a DD-tag.

I do the following to get the link information:

doc.xpath('//dt//a').each do |node|
  title = node.text
  url = node['href']
  tags = node['tags']
  puts "#{title}, #{url}, #{tags}"
end

My question is how do I get the link information AND the comment when a dd tag is present?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

孤云独去闲 2024-10-15 09:28:40

我的问题是如何获取链接
dd 时的信息和评论
标签存在吗?

使用

//DT/a | //DT[a]/following-sibling::*[1][self::DD]

这将选择具有 DT 父级的所有 a 元素以及紧随其后的同级元素的所有 DD 元素具有 a 子元素的 DT 元素的元素。

注意:强烈建议不要使用//,因为它通常会导致开发人员使用效率低下和异常。

只要 XML 文档的结构已知,就避免使用 // 缩写

My question is how do I get the link
information AND the comment when a dd
tag is present?

Use:

//DT/a | //DT[a]/following-sibling::*[1][self::DD]

This selects all a elements that have a DT parent and all DD elements that are the immediate following sibling element of a DT element that has an a child.

Note: The use of the // is strongly discouraged because it usually leads to inefficiencies and anomalies in its use for the developers.

Whenever the structure of the XML document is known, avoid using the // abbreviation.

小…楫夜泊 2024-10-15 09:28:40

你的问题不清楚你在寻找什么。

首先,HTML 格式错误,因为

标记未正确关闭,并且第一个 a 标记的文本中存在 Ruby 1.9.2 不存在的非法字符不喜欢,因为它不是 UTF-8。我在 TextMate 中将字符转换为实体。

html = %{
<DT>
  <A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" ADD_DATE="1233132422" PRIVATE="0" TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
  <A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" ADD_DATE="1226827542" PRIVATE="0" TAGS="irw_20">Minority Report Interface</A>
<DT>
  <A HREF="http://www.windowshop.com/" ADD_DATE="1225267658" PRIVATE="0" TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon
}

在尝试修复它之后,该 HTML 在 Nokogiri 中解析为:

(rdb:1) print doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<dt>
  <a href="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" add_date="1233132422" private="0" tags="irw_20">mezzoblue § Sprite Optimization</a>
<dt>
  <a href="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" add_date="1226827542" private="0" tags="irw_20">Minority Report Interface</a>
<dt>
  <a href="http://www.windowshop.com/" add_date="1225267658" private="0" tags="irw_20">Amazon Windowshop Beta</a>
</dt>
</dt>
</dt>
<dd>Window shopping from Amazon
</dd>
</body></html>

请注意,结束 dt 标记是如何在唯一的 dd 标记之前分组的?这很烦人,但没关系,因为它不会改变我们寻找 dd 内容的方式。

doc = Nokogiri::HTML(html, nil, 'UTF-8')

comments = []
doc.css('dt + dd').each do |a|
  comments << a.text
end
puts comments

# >> Window shopping from Amazon

这意味着,找到

后跟
。您不会/不能查找 dt 后跟 a 后跟 dd,因为这不是 HTML 解析的方式。它实际上是 dt 后跟 dd,这就是“dt + dd”的含义。

您的问题看起来的另一种方式是您正在寻找 a 标签的内容:

comments = []
doc.css('a').each do |a|
  comments << a.text
end
puts comments

# >> mezzoblue § Sprite Optimization
# >> Minority Report Interface
# >> Amazon Windowshop Beta

Your question isn't clear about what you are looking for.

First, the HTML is malformed because the <DT> tags are not closed correctly, and there is an illegal character in the first a tag's text that Ruby 1.9.2 doesn't like because it's not UTF-8. I converted the character to an entity in TextMate.

html = %{
<DT>
  <A HREF="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" ADD_DATE="1233132422" PRIVATE="0" TAGS="irw_20">mezzoblue § Sprite Optimization</A>
<DT>
  <A HREF="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" ADD_DATE="1226827542" PRIVATE="0" TAGS="irw_20">Minority Report Interface</A>
<DT>
  <A HREF="http://www.windowshop.com/" ADD_DATE="1225267658" PRIVATE="0" TAGS="irw_20">Amazon Windowshop Beta</A>
<DD>Window shopping from Amazon
}

That HTML parses to this in Nokogiri after it tries to fix it up:

(rdb:1) print doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<dt>
  <a href="http://mezzoblue.com/archives/2009/01/27/sprite_optim/" add_date="1233132422" private="0" tags="irw_20">mezzoblue § Sprite Optimization</a>
<dt>
  <a href="http://datamining.typepad.com/data_mining/2008/11/minority-report-interface.html" add_date="1226827542" private="0" tags="irw_20">Minority Report Interface</a>
<dt>
  <a href="http://www.windowshop.com/" add_date="1225267658" private="0" tags="irw_20">Amazon Windowshop Beta</a>
</dt>
</dt>
</dt>
<dd>Window shopping from Amazon
</dd>
</body></html>

Notice how the closing dt tags are grouped just before the only dd tag? That's icky, but ok because it doesn't change how we have to look for the dd content.

doc = Nokogiri::HTML(html, nil, 'UTF-8')

comments = []
doc.css('dt + dd').each do |a|
  comments << a.text
end
puts comments

# >> Window shopping from Amazon

That means, find <dt> followed by <dd>. You don't/can't look for dt followed by a followed by dd because that's not how the HTML parses. It would really be dt followed by dd, which is what "dt + dd" means.

The other way it seemed like your question could read was that you were looking for the content of the a tags:

comments = []
doc.css('a').each do |a|
  comments << a.text
end
puts comments

# >> mezzoblue § Sprite Optimization
# >> Minority Report Interface
# >> Amazon Windowshop Beta
梦回梦里 2024-10-15 09:28:40

我假设:

<DD>Window shopping from Amazon

有一个结束 /DD 标签,我无法仅从您的页面片段中看出。如果是这样,您可以这样做:

comment = node.parent.next_sibling.next_sibling.text rescue nil

您需要调用 next_sibling 两次,因为第一个将匹配 \n (换行符)或空格。您可以在解析页面之前删除所有新行以避免重复调用。如果 DT 标记后有超过 1 个换行符,这也可能是一个好主意

I'm assuming the:

<DD>Window shopping from Amazon

has an ending /DD tag, I can't tell from just your snippet of the page. If so, you could do:

comment = node.parent.next_sibling.next_sibling.text rescue nil

You need to call next_sibling twice because the first one will match a \n (new line) or whitespace. You could remove all the new lines prior to parsing the page to avoid the double call. That might also be a good idea in case there's more than 1 new line character after the DT tag

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文