rvest:基于后续节点的滤波节点
我正在尝试从网页上刮擦链接。该网页具有超链接的图像和超链接的H3
标题。我想丢弃图像的链接。不幸的是,没有div
s的类,ID或属性来标识图像超链接。 rvest
或bs4
中是否有一些逻辑,以根据后来嵌套的HTML元素过滤链接?例如,如果下一个元素是img
,则忽略下一个元素,如果下一个元素是h3
,则保留?
html <- '<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div>
<div>
<div>
<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" style="max-width:72px;max-height:72px">
</a>
</div>
<span>
<h3>
<div>
Smiley Face
</div>
</h3>
</span>
<span>
<div>
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png
</div>
</span>
</div>
</div>
<div>
<div>
<a href="https://www.hbs.edu">
<h3>
<div>Harvard Business School</div>
</h3>
<div>https://www.hbs.edu</div>
</a>
</div>
</div>
</body>
</html>'
my_page <- read_html(html)
my_page %>%
html_nodes("a") %>%
html_attr("href")
# [1] "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" # Want to ignore this
# [2] "https://www.hbs.edu" # Want to keep this
I'm trying to scrape links from a webpage. The webpage has hyperlinked images and hyperlinked h3
headers. I want to discard the links for the images. Unfortunately, there are no classes, ids, or attributes of the div
s to identify the image hyperlinks. Is there some logic in rvest
or bs4
to filter out the links based on the subsequently nested HTML elements? For example, if the next element is a img
then ignore, if the next element is a h3
then keep?
html <- '<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<div>
<div>
<div>
<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" style="max-width:72px;max-height:72px">
</a>
</div>
<span>
<h3>
<div>
Smiley Face
</div>
</h3>
</span>
<span>
<div>
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png
</div>
</span>
</div>
</div>
<div>
<div>
<a href="https://www.hbs.edu">
<h3>
<div>Harvard Business School</div>
</h3>
<div>https://www.hbs.edu</div>
</a>
</div>
</div>
</body>
</html>'
my_page <- read_html(html)
my_page %>%
html_nodes("a") %>%
html_attr("href")
# [1] "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" # Want to ignore this
# [2] "https://www.hbs.edu" # Want to keep this
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用 rvest ,您可能希望使用XPath(父轴),以指定父子(锚tag h3 tag)关系如下:
使用 bs4 您可以使用:带有&gt;的伪类选择器儿童组合器以指定锚标签与直接儿童H3元素的关系。如果可以是任何孩子而不是直接的孩子(DOM深度的电位差异),则可以将子组合量转换为后代组合者
,我已经指定了父锚标签必须具有HREF属性。
With rvest you might wish to use xpath (parent axis) so as to specify the parent child (anchor tag h3 tag) relationship as follows:
With bs4 you can use :has pseudo class selector with > child combinator to specify relationship of anchor tag with direct child h3 element. You can swop the child combinator for a descendant combinator if can be any child and not a direct child (potential difference in DOM depth)
In either case, I have specified that the parent anchor tag must have an href attribute.