rvest：基于后续节点的滤波节点

发布于 2025-02-05 21:00:10 字数 1742 浏览 2 评论 0原文

我正在尝试从网页上刮擦链接。该网页具有超链接的图像和超链接的H3标题。我想丢弃图像的链接。不幸的是，没有div s的类，ID或属性来标识图像超链接。 rvest或bs4中是否有一些逻辑，以根据后来嵌套的HTML元素过滤链接？例如，如果下一个元素是img，则忽略下一个元素，如果下一个元素是h3，则保留？

html <- '<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<div>
  <div>
    <div>
        <a href="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png">
            <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" style="max-width:72px;max-height:72px">
        </a>
    </div>
      <span>
          <h3>
             <div>
                 Smiley Face
             </div>
          </h3>
      </span>
      <span>
          <div>
              https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png
          </div>
      </span>
  </div>
</div>
<div>
    <div>
        <a href="https://www.hbs.edu">
            <h3>
                <div>Harvard Business School</div>
            </h3>
            <div>https://www.hbs.edu</div>
        </a>
    </div>
</div>
</body>
</html>'

my_page <- read_html(html)
my_page %>%
  html_nodes("a") %>%
  html_attr("href")

# [1] "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" # Want to ignore this
# [2] "https://www.hbs.edu" # Want to keep this

原文

I'm trying to scrape links from a webpage. The webpage has hyperlinked images and hyperlinked h3 headers. I want to discard the links for the images. Unfortunately, there are no classes, ids, or attributes of the divs to identify the image hyperlinks. Is there some logic in rvest or bs4 to filter out the links based on the subsequently nested HTML elements? For example, if the next element is a img then ignore, if the next element is a h3 then keep?

html <- '<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<div>
  <div>
    <div>
        <a href="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png">
            <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" style="max-width:72px;max-height:72px">
        </a>
    </div>
      <span>
          <h3>
             <div>
                 Smiley Face
             </div>
          </h3>
      </span>
      <span>
          <div>
              https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png
          </div>
      </span>
  </div>
</div>
<div>
    <div>
        <a href="https://www.hbs.edu">
            <h3>
                <div>Harvard Business School</div>
            </h3>
            <div>https://www.hbs.edu</div>
        </a>
    </div>
</div>
</body>
</html>'

my_page <- read_html(html)
my_page %>%
  html_nodes("a") %>%
  html_attr("href")

# [1] "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" # Want to ignore this
# [2] "https://www.hbs.edu" # Want to keep this

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

多谢你的绝情让我学会死心 2025-02-12 21:00:10

使用 rvest ，您可能希望使用XPath（父轴），以指定父子（锚tag h3 tag）关系如下：

library(rvest)
library(magrittr)

html <- '<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<div>
  <div>
    <div>
        <a href="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png">
            <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" style="max-width:72px;max-height:72px">
        </a>
    </div>
      <span>
          <h3>
             <div>
                 Smiley Face
             </div>
          </h3>
      </span>
      <span>
          <div>
              https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png
          </div>
      </span>
  </div>
</div>
<div>
    <div>
        <a href="https://www.hbs.edu">
            <h3>
                <div>Harvard Business School</div>
            </h3>
            <div>https://www.hbs.edu</div>
        </a>
    </div>
</div>
</body>
</html>'

my_page <- read_html(html)
my_page %>%
  html_elements(xpath = "//h3/parent::a[@href]") %>%
  html_attr("href")

使用 bs4 您可以使用：带有＆gt;的伪类选择器儿童组合器以指定锚标签与直接儿童H3元素的关系。如果可以是任何孩子而不是直接的孩子（DOM深度的电位差异），则可以将子组合量转换为后代组合者

from bs4 import BeautifulSoup as bs

html =  '''<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<div>
  <div>
    <div>
        <a href="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png">
            <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" style="max-width:72px;max-height:72px">
        </a>
    </div>
      <span>
          <h3>
             <div>
                 Smiley Face
             </div>
          </h3>
      </span>
      <span>
          <div>
              https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png
          </div>
      </span>
  </div>
</div>
<div>
    <div>
        <a href="https://www.hbs.edu">
            <h3>
                <div>Harvard Business School</div>
            </h3>
            <div>https://www.hbs.edu</div>
        </a>
    </div>
</div>
</body>
</html>'''

soup = bs(html, 'lxml') # pip install lxml if missing
print([i['href'] for i in soup.select('a[href]:has(> h3)')])

，我已经指定了父锚标签必须具有HREF属性。

With rvest you might wish to use xpath (parent axis) so as to specify the parent child (anchor tag h3 tag) relationship as follows:

library(rvest)
library(magrittr)

html <- '<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<div>
  <div>
    <div>
        <a href="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png">
            <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" style="max-width:72px;max-height:72px">
        </a>
    </div>
      <span>
          <h3>
             <div>
                 Smiley Face
             </div>
          </h3>
      </span>
      <span>
          <div>
              https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png
          </div>
      </span>
  </div>
</div>
<div>
    <div>
        <a href="https://www.hbs.edu">
            <h3>
                <div>Harvard Business School</div>
            </h3>
            <div>https://www.hbs.edu</div>
        </a>
    </div>
</div>
</body>
</html>'

my_page <- read_html(html)
my_page %>%
  html_elements(xpath = "//h3/parent::a[@href]") %>%
  html_attr("href")

With bs4 you can use :has pseudo class selector with > child combinator to specify relationship of anchor tag with direct child h3 element. You can swop the child combinator for a descendant combinator if can be any child and not a direct child (potential difference in DOM depth)

from bs4 import BeautifulSoup as bs

html =  '''<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<div>
  <div>
    <div>
        <a href="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png">
            <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" style="max-width:72px;max-height:72px">
        </a>
    </div>
      <span>
          <h3>
             <div>
                 Smiley Face
             </div>
          </h3>
      </span>
      <span>
          <div>
              https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png
          </div>
      </span>
  </div>
</div>
<div>
    <div>
        <a href="https://www.hbs.edu">
            <h3>
                <div>Harvard Business School</div>
            </h3>
            <div>https://www.hbs.edu</div>
        </a>
    </div>
</div>
</body>
</html>'''

soup = bs(html, 'lxml') # pip install lxml if missing
print([i['href'] for i in soup.select('a[href]:has(> h3)')])

In either case, I have specified that the parent anchor tag must have an href attribute.

回复收藏 0 原文

~没有更多了~