R：从列表中提取HTML

发布于 2025-02-11 04:59:45 字数 2438 浏览 1 评论 0原文

我正在使用R编程语言。我有一个包含HTTP链接的列表（除其他内容）外观：

    library(rvest)
    library(httr)
    library(XML)

    url<-"mywebsite.com"
    page <-read_html(url)
    links1 = page %>% html_nodes("li")

head(links1)

{xml_nodeset (393)}
 [3] <li class="social-icon"><a class="tip-me" href="https://www.youtube.com/channel/UCYNT3iuUwsnEwelGScQ3k1A/videos" data-toggle="tooltip" data-animation="true" title= ...
 [4] <li class="social-icon"><a class="tip-me" href="https://www.web222.ca" data- ...
 [5] <li class="social-icon"><a class="tip-me" href="#" data-toggle="tooltip" data-animation="true" title=""><span class="icon-dribbble"></span></a></li>
 [6] <li><a href="https://www.web777.ca/">Home</a></li>\n
 [7] <li id="menu-item-17" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home menu-item-17"><a href="https://www.web555.ca/" ...
 [8] <li id="menu-item-2606" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-2606"><a href="https://www.web111.ca">L ...
 [9] <li id="menu-item-18618" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-has-children menu-item-18618">\n<a href="#">Local Listings</a>\n< ...
[10] <li id="menu-item-10758" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-10758"><a href="https://www.web123.ca/den ...
[11] <li id="menu-item-1227" class="menu-item menu-item-type-taxonomy menu-item-object-listings_categories menu-item-1227"><a href="https://www.web123.c ...
[12] <li id="menu-item-1226" class="menu-item menu-item-type-taxonomy menu-item-object-listings_categories menu-item-1226"><a href="https://www.web124.c ...
[13] <li id="menu-item-883" class=

我想提取此列表中包含的每个URL - 我认为这些URL存储在列表的“ HREF”部分中。我尝试了不同的方法来做到这一点 - 但最终，我发现了一种稍有不同的方法：

# source: https://www.geeksforgeeks.org/extract-all-the-urls-from-the-webpage-using-r-language/

# making http request
resource <- GET(url)

# converting all the data to HTML format
parse <- htmlParse(resource)

# scrapping all the href tags
links2 <- xpathSApply(parse, path="//a", xmlGetAttr, "href")

# printing links
print(links2)

我的问题：我本来想可能有一些从“ links1”中提取链接的方法。而不是像“ links2”那样从其他方法中解决这个问题。有人可以告诉我如何从“ links1”中提取URL链接？

谢谢！

原文

I am working with the R programming language. I have a list that contains HTTP links (amongst other things) and looks something like this:

    library(rvest)
    library(httr)
    library(XML)

    url<-"mywebsite.com"
    page <-read_html(url)
    links1 = page %>% html_nodes("li")

head(links1)

{xml_nodeset (393)}
 [3] <li class="social-icon"><a class="tip-me" href="https://www.youtube.com/channel/UCYNT3iuUwsnEwelGScQ3k1A/videos" data-toggle="tooltip" data-animation="true" title= ...
 [4] <li class="social-icon"><a class="tip-me" href="https://www.web222.ca" data- ...
 [5] <li class="social-icon"><a class="tip-me" href="#" data-toggle="tooltip" data-animation="true" title=""><span class="icon-dribbble"></span></a></li>
 [6] <li><a href="https://www.web777.ca/">Home</a></li>\n
 [7] <li id="menu-item-17" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home menu-item-17"><a href="https://www.web555.ca/" ...
 [8] <li id="menu-item-2606" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-2606"><a href="https://www.web111.ca">L ...
 [9] <li id="menu-item-18618" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-has-children menu-item-18618">\n<a href="#">Local Listings</a>\n< ...
[10] <li id="menu-item-10758" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-10758"><a href="https://www.web123.ca/den ...
[11] <li id="menu-item-1227" class="menu-item menu-item-type-taxonomy menu-item-object-listings_categories menu-item-1227"><a href="https://www.web123.c ...
[12] <li id="menu-item-1226" class="menu-item menu-item-type-taxonomy menu-item-object-listings_categories menu-item-1226"><a href="https://www.web124.c ...
[13] <li id="menu-item-883" class=

I want to extract every URL contained in this list - I think these are stored in the "href" part of the list. I tried different ways to do this - but in the end, I figured out a slightly different way of doing this:

# source: https://www.geeksforgeeks.org/extract-all-the-urls-from-the-webpage-using-r-language/

# making http request
resource <- GET(url)

# converting all the data to HTML format
parse <- htmlParse(resource)

# scrapping all the href tags
links2 <- xpathSApply(parse, path="//a", xmlGetAttr, "href")

# printing links
print(links2)

My Question: I would have thought there might be someway to extract the links from "links1" instead of having to approach this problem from a different method as I did with "links2". Can someone please show me how I would have extracted the URL links from "links1"?

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

下雨或天晴 2025-02-18 04:59:45

尝试一下

links1 = page %>% 
  html_nodes("li a") %>% 
  html_attr("href") %>% 
  html_text2()

Try this

links1 = page %>% 
  html_nodes("li a") %>% 
  html_attr("href") %>% 
  html_text2()

回复收藏 0 原文

~没有更多了~

关于作者

贱贱哒

暂无简介

文章

524 人气

关注发私信

友情链接

文江博客

R：从列表中提取HTML

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

闻呓

深府石板幽径

mabiao

枕花眠

qq_CrTt6n

红颜悴

友情链接

R：从列表中提取HTML

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

闻呓

深府石板幽径

mabiao

枕花眠

qq_CrTt6n

红颜悴

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。