Yahoo Pipe:如何解析子 DIV
对于一个有多个DIV的页面,如何只从包含有用文本的DIV中获取内容,并避免其他用于广告的DIV等。
例如,这样的页面结构:
……
<div id="articlecopy">
<div class="advertising 1">Ads I do not want to fetch.</div>
<p>Useful texts go here</p>
<div class="advertising 2">Ads I do not want to fetch.</div>
<div class="related_articles_list">I do not want to read related articles so parse this part too</div>
</div>
在
这个虚构的例子中,我想去掉广告的两个 DIV 和相关文章的 DIV。我想要的只是获取
父 DIV 中的有用内容。
管道可以做到这一点吗?
谢谢。
For a page which has multiple DIVs, how to just fetch content from DIVs that contain useful text and avoid other DIVs that are for ads, etc.
For example, a page structure like this:
...
<div id="articlecopy">
<div class="advertising 1">Ads I do not want to fetch.</div>
<p>Useful texts go here</p>
<div class="advertising 2">Ads I do not want to fetch.</div>
<div class="related_articles_list">I do not want to read related articles so parse this part too</div>
</div>
...
In this fictional example, I want get rid of the two DIVs for advertising and the DIV for related articles. All I want is to fetch the useful content in
inside the parent DIV.
Can Pipe do this?
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尝试使用 xpath 的 YQL 模块。沿着这些思路:
上面的查询将检索
内的 html 部分。父级
举例来说,您有一个包含多个 DIV 的页面,但您想要的页面如下所示:
您可以将上面的 YQL 字符串更改为:
我自己最近才发现 YQL,并且对使用 xpath 相当陌生,但它有到目前为止为我工作。
Try the YQL module with xpath. Something along these lines:
The above query will retrieve the part of the html inside the <p> tag under the parent <div> tag. You can get fancy with xpath if your DIVs have attributes.
Say for example you had a page with several DIVs, but the one you wanted looked like this:
You would change the YQL string above to this:
I've only recently discovered YQL myself, and am fairly new to using xpaths, but it has worked for me so far.