Yahoo Pipe:如何解析子 DIV

发布于 2024-11-06 08:34:24 字数 583 浏览 2 评论 0原文

对于一个有多个DIV的页面,如何只从包含有用文本的DIV中获取内容,并避免其他用于广告的DIV等。

例如,这样的页面结构:

……

<div id="articlecopy">

  <div class="advertising 1">Ads I do not want to fetch.</div>

  <p>Useful texts go here</p>

  <div class="advertising 2">Ads I do not want to fetch.</div>

  <div class="related_articles_list">I do not want to read related articles so parse this part too</div>

</div>

这个虚构的例子中,我想去掉广告的两个 DIV 和相关文章的 DIV。我想要的只是获取

父 DIV 中的有用内容。

管道可以做到这一点吗?

谢谢。

For a page which has multiple DIVs, how to just fetch content from DIVs that contain useful text and avoid other DIVs that are for ads, etc.

For example, a page structure like this:

...

<div id="articlecopy">

  <div class="advertising 1">Ads I do not want to fetch.</div>

  <p>Useful texts go here</p>

  <div class="advertising 2">Ads I do not want to fetch.</div>

  <div class="related_articles_list">I do not want to read related articles so parse this part too</div>

</div>

...

In this fictional example, I want get rid of the two DIVs for advertising and the DIV for related articles. All I want is to fetch the useful content in

inside the parent DIV.

Can Pipe do this?

Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

零度℉ 2024-11-13 08:34:24

尝试使用 xpath 的 YQL 模块。沿着这些思路:

SELECT * from html where url="http://MyWebPageWithAds.com" and xpath='//div/p'

上面的查询将检索

内的 html 部分。父级

下的标签标签。如果您的 DIV 具有属性,您会喜欢 xpath。

举例来说,您有一个包含多个 DIV 的页面,但您想要的页面如下所示:

<div>
    <div>Stuff I don't want</div>
    <div class="main_content">Stuff I want to add to my feed</div>
    <div>Other stuff I don't want</div> 
</div>

您可以将上面的 YQL 字符串更改为:

SELECT * from html where url="http://MyWebPageWithAds.com" 
and xpath='//div/div[contains(@class,"main_content")]'

我自己最近才发现 YQL,并且对使用 xpath 相当陌生,但它有到目前为止为我工作。

Try the YQL module with xpath. Something along these lines:

SELECT * from html where url="http://MyWebPageWithAds.com" and xpath='//div/p'

The above query will retrieve the part of the html inside the <p> tag under the parent <div> tag. You can get fancy with xpath if your DIVs have attributes.

Say for example you had a page with several DIVs, but the one you wanted looked like this:

<div>
    <div>Stuff I don't want</div>
    <div class="main_content">Stuff I want to add to my feed</div>
    <div>Other stuff I don't want</div> 
</div>

You would change the YQL string above to this:

SELECT * from html where url="http://MyWebPageWithAds.com" 
and xpath='//div/div[contains(@class,"main_content")]'

I've only recently discovered YQL myself, and am fairly new to using xpaths, but it has worked for me so far.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文