如何在 R 中读取和解析网页内容

发布于 2024-08-14 02:47:43 字数 118 浏览 10 评论 0原文

我想阅读 URL 的内容(例如 http://www.haaretz.com/ )在 R 中。我想知道我该怎么做

I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

枫林﹌晚霞¤ 2024-08-21 02:47:43

不太确定你想如何处理该页面,因为它真的很乱。正如我们在这个著名的 stackoverflow 问题中重新学习的 ,在 html 上执行正则表达式不是一个好主意,因此您肯定会希望使用 XML 包来解析它。

下面是一个帮助您入门的示例:

require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]

这会产生一个主要由网页文本(以及一些 javascript)组成的字符向量:

> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time: 16:48 (EST+7)"           
[4] "  Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()" 

Not really sure how you want to process that page, because it's really messy. As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.

Here's an example to get you started:

require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]

This results in a character vector of mostly just webpage text (along with some javascript):

> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time: 16:48 (EST+7)"           
[4] "  Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()" 
孤蝉 2024-08-21 02:47:43

您最好的选择可能是 XML 包 - 例如参见这个 上一个问题

Your best bet may be the XML package -- see for example this previous question.

爱殇璃 2024-08-21 02:47:43

我知道你要求 R。但也许 python+beautifullsoup 是这里的前进方向?然后用 R 进行分析,你用 beautifullsoup 刮掉了屏幕吗?

I know you asked for R. But maybe python+beautifullsoup is the way forward here? Then do your analysis with R you have scraped the screen with beautifullsoup?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文