如何使用
从网页中提取文本使用 Hpricot 标签？

发布于 2024-10-06 12:51:39 字数 956 浏览 8 评论 0原文

我正在尝试使用 Hpricot 和 Ruby 解析 HTML 文件，但在提取“自由浮动”文本时遇到问题，该文本未包含在

等标签中。

require 'hpricot'

text = <<SOME_TEXT
  <a href="http://www.somelink.com/foo/bar.html">Testing:</a><br />
  line 1<br />  
  line 2<br />
  line 3<br />
  line 4<br />
  line 5<br />
  <b>Here's some more text</b>
SOME_TEXT

parsed = Hpricot(text)

parsed = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed

我希望结果是

<br />
line 1<br />  
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>

但我得到

<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>

How can I make Hpricot return line 1, line 2, etc?

原文

I'm trying to parse an HTML file using Hpricot and Ruby, but I'm having issues extracting "free floating" text which is not enclosed in tags like <p></p>.

require 'hpricot'

text = <<SOME_TEXT
  <a href="http://www.somelink.com/foo/bar.html">Testing:</a><br />
  line 1<br />  
  line 2<br />
  line 3<br />
  line 4<br />
  line 5<br />
  <b>Here's some more text</b>
SOME_TEXT

parsed = Hpricot(text)

parsed = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed

I would expect the result to be

<br />
line 1<br />  
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>

But I am getting

<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>

How can I make Hpricot return line 1, line 2, etc?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

抚你发端 2024-10-13 12:51:40

我已经有一段时间没有使用 Hpricot 了，但我记得以下一些可能会有所帮助的事情：

获取所有文本的快速方法：

irb(main):023:0> print parsed.inner_text
  Testing:
  line 1  
  line 2
  line 3
  line 4
  line 5
  Here's some more text

这样做的缺点是您也会将文本嵌入到标签中。

类似地，我们可以搜索所有 'text()' 节点：

irb(main):033:0> puts (parsed / 'text()')

Testing:

  line 1

  [...]

  line 5

因此，我们可以这样做：

irb(main):036:0> puts (parsed / 'text()')[2 .. -3]

  line 1

  line 2

  line 3

  line 4

  line 5

或：

irb(main):037:0> (parsed / 'text()')[2 .. -3]
=> #<Hpricot::Elements["\n  line 1", "  \n  line 2", "\n  line 3", "\n  line 4", "\n  line 5", "\n  "]>

或：

irb(main):039:0> (parsed / 'text()')[2 .. -3].map{ |t| t.inner_text.strip }
=> ["line 1", "line 2", "line 3", "line 4", "line 5", ""]

从网页中抓取数据/文本的主要思想是寻找您可以的地标用于浏览页面。通常我们可以从

或

标记内部获取文本。如果页面没有给你地标，你就必须使用其他技巧；寻找一系列文本节点，后面可能是节点，或者是标记后面带有特定 href属性。这就是处理 HTML 的乐趣和挑战。

在我的脑海深处，有一个挥之不去的想法，认为有一种更优雅的方法可以做到这一点，但这似乎有效。在 Hpricot 挑战页面上挖掘内容，了解挖掘内容主题的变化。

It's been a while since I've used Hpricot but here's some things I remember that might help:

The quick way to get all the text:

irb(main):023:0> print parsed.inner_text
  Testing:
  line 1  
  line 2
  line 3
  line 4
  line 5
  Here's some more text

The downside to that is you get the text embedded in tags too.

Similarly, we can search for all 'text()' nodes:

irb(main):033:0> puts (parsed / 'text()')

Testing:

  line 1

  [...]

  line 5

So, we can do this:

irb(main):036:0> puts (parsed / 'text()')[2 .. -3]

  line 1

  line 2

  line 3

  line 4

  line 5

or:

irb(main):037:0> (parsed / 'text()')[2 .. -3]
=> #<Hpricot::Elements["\n  line 1", "  \n  line 2", "\n  line 3", "\n  line 4", "\n  line 5", "\n  "]>

or:

irb(main):039:0> (parsed / 'text()')[2 .. -3].map{ |t| t.inner_text.strip }
=> ["line 1", "line 2", "line 3", "line 4", "line 5", ""]

The main idea for grabbing data/text from a web page is look for landmarks you can use to navigate through the page. Often we can grab text from inside a <div> or <p> tag. If a page doesn't give you landmarks you have to use other tricks; Looking for a series of text nodes followed by <br> nodes maybe, or the five lines following an <a> tag with a certain href attribute. That's the fun and challenge of dealing with HTML.

In the back of my mind there's a nagging thought that there is a more elegant way to do this, but this seems to be working. Dig around on the Hpricot Challenge page for variations on themes on digging out content.

回复收藏 0 原文

半﹌身腐败 2024-10-13 12:51:39

第一步是阅读 following_siblings 文档：

查找当前元素之后的同级元素。与其他“兄弟”方法一样，此方法清除文本和注释节点。

然后，您应该使用 Hpricot 源来概括 following_siblings 的工作原理，以获得类似于 following_siblings 但不会过滤掉非容器节点的内容：

parsed        = Hpricot(text)
link          = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first
link_sibs     = link.parent.children
what_you_want = link_sibs[link_sibs.index(link) + 1 ... link_sibs.length]

puts what_you_want

这几乎是 follow_siblings 与 parent.children 而不是 parent.containers。访问您使用的库的源代码非常方便，并且鼓励学习它。

Your first step is to read the following_siblings documentation:

Find sibling elements which follow the current one. Like the other “sibling” methods, this weeds out text and comment nodes.

Then you should use the Hpricot source to generalize how following_siblings works to get something that works like following_siblings but doesn't filter out non-container nodes:

parsed        = Hpricot(text)
link          = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first
link_sibs     = link.parent.children
what_you_want = link_sibs[link_sibs.index(link) + 1 ... link_sibs.length]

puts what_you_want

That's pretty much following_siblings with parent.children instead of parent.containers. Having access to the source code of the libraries you use is pretty handy and studying it is to be encouraged.

回复收藏 0 原文

~没有更多了~