如何使用
从网页中提取文本使用 Hpricot 标签?
我正在尝试使用 Hpricot 和 Ruby 解析 HTML 文件,但在提取“自由浮动”文本时遇到问题,该文本未包含在 等标签中。
require 'hpricot'
text = <<SOME_TEXT
<a href="http://www.somelink.com/foo/bar.html">Testing:</a><br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
SOME_TEXT
parsed = Hpricot(text)
parsed = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed
我希望结果是
<br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
但我得到
<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>
How can I make Hpricot return line 1, line 2, etc?
I'm trying to parse an HTML file using Hpricot and Ruby, but I'm having issues extracting "free floating" text which is not enclosed in tags like <p></p>
.
require 'hpricot'
text = <<SOME_TEXT
<a href="http://www.somelink.com/foo/bar.html">Testing:</a><br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
SOME_TEXT
parsed = Hpricot(text)
parsed = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed
I would expect the result to be
<br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
But I am getting
<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>
How can I make Hpricot return line 1, line 2, etc?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我已经有一段时间没有使用 Hpricot 了,但我记得以下一些可能会有所帮助的事情:
获取所有文本的快速方法:
这样做的缺点是您也会将文本嵌入到标签中。
类似地,我们可以搜索所有
'text()'
节点:因此,我们可以这样做:
或:
或:
从网页中抓取数据/文本的主要思想是寻找您可以的地标用于浏览页面。通常我们可以从
或
标记内部获取文本。如果页面没有给你地标,你就必须使用其他技巧;寻找一系列文本节点,后面可能是
节点,或者是标记后面带有特定
href属性。这就是处理 HTML 的乐趣和挑战。
在我的脑海深处,有一个挥之不去的想法,认为有一种更优雅的方法可以做到这一点,但这似乎有效。在 Hpricot 挑战页面上挖掘内容,了解挖掘内容主题的变化。
It's been a while since I've used Hpricot but here's some things I remember that might help:
The quick way to get all the text:
The downside to that is you get the text embedded in tags too.
Similarly, we can search for all
'text()'
nodes:So, we can do this:
or:
or:
The main idea for grabbing data/text from a web page is look for landmarks you can use to navigate through the page. Often we can grab text from inside a
<div>
or<p>
tag. If a page doesn't give you landmarks you have to use other tricks; Looking for a series of text nodes followed by<br>
nodes maybe, or the five lines following an<a>
tag with a certainhref
attribute. That's the fun and challenge of dealing with HTML.In the back of my mind there's a nagging thought that there is a more elegant way to do this, but this seems to be working. Dig around on the Hpricot Challenge page for variations on themes on digging out content.
第一步是阅读 following_siblings 文档:
然后,您应该使用 Hpricot 源来概括
following_siblings
的工作原理,以获得类似于following_siblings
但不会过滤掉非容器节点的内容:这几乎是
follow_siblings
与parent.children
而不是parent.containers
。访问您使用的库的源代码非常方便,并且鼓励学习它。Your first step is to read the following_siblings documentation:
Then you should use the Hpricot source to generalize how
following_siblings
works to get something that works likefollowing_siblings
but doesn't filter out non-container nodes:That's pretty much
following_siblings
withparent.children
instead ofparent.containers
. Having access to the source code of the libraries you use is pretty handy and studying it is to be encouraged.