如何在 R 中从抓取的网页中分离单个元素

发布于 2024-09-05 04:09:29 字数 1365 浏览 5 评论 0原文

我想使用 R 来抓取此页面:(http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html)等,获取进球者和时间。

到目前为止,这就是我所得到的:

require(RCurl)
require(XML)

theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
webpage <- getURL(theURL, header=FALSE, verbose=TRUE) 
webpagecont <- readLines(tc <- textConnection(webpage)); close(tc)  

pagetree <- htmlTreeParse(webpagecont, error=function(...){}, useInternalNodes = TRUE)

并且 pagetree 对象现在包含一个指向我解析的 html 的指针(我认为)。我想要的部分是:

<div class="cont")<ul>
<div class="bold medium">Goals scored</div>
        <li>Philipp LAHM (GER) 6', </li>
        <li>Paulo WANCHOPE (CRC) 12', </li>
        <li>Miroslav KLOSE (GER) 17', </li>
        <li>Miroslav KLOSE (GER) 61', </li>
        <li>Paulo WANCHOPE (CRC) 73', </li>
        <li>Torsten FRINGS (GER) 87'</li>
</ul></div>

但我现在不知道如何隔离它们,坦率地说 xpathSApplyxpathApply 让我感到困惑!

那么,有谁知道如何制定命令来吸出

标签中包含的元素?

I want to use R to scrape this page: (http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html ) and others, to get the goal scorers and times.

So far, this is what I've got:

require(RCurl)
require(XML)

theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
webpage <- getURL(theURL, header=FALSE, verbose=TRUE) 
webpagecont <- readLines(tc <- textConnection(webpage)); close(tc)  

pagetree <- htmlTreeParse(webpagecont, error=function(...){}, useInternalNodes = TRUE)

and the pagetree object now contains a pointer to my parsed html (I think). The part I want is:

<div class="cont")<ul>
<div class="bold medium">Goals scored</div>
        <li>Philipp LAHM (GER) 6', </li>
        <li>Paulo WANCHOPE (CRC) 12', </li>
        <li>Miroslav KLOSE (GER) 17', </li>
        <li>Miroslav KLOSE (GER) 61', </li>
        <li>Paulo WANCHOPE (CRC) 73', </li>
        <li>Torsten FRINGS (GER) 87'</li>
</ul></div>

But I'm now lost as to how to isolate them, and frankly xpathSApply and xpathApply confuse the beejeebies out of me!

So, does anyone know how to formulate a command to suck out the element contained within the <div class="cont"> tags?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

耳钉梦 2024-09-12 04:09:29

在处理 R 中的网页抓取和 XML 时,这些问题非常有帮助:

  1. 使用 XML 包将 html 表抓取到 R 数据框中
  2. 如何将 XML 数据转换为 data.frame?

对于您的特定示例,虽然我不确定您希望输出是什么样子,但这会得到“得分目标”作为字符向量:

theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
fifa.doc <- htmlParse(theURL)
fifa <- xpathSApply(fifa.doc, "//*/div[@class='cont']", xmlValue)
goals.scored <- grep("Goals scored", fifa, value=TRUE)

xpathSApply 函数获取与给定条件匹配的所有值,并将它们作为向量返回。请注意我如何寻找 class='cont' 的 div。使用类值通常是解析 HTML 文档的好方法,因为它们是很好的标记。

您可以根据需要清理它:

> gsub("Goals scored", "", strsplit(goals.scored, ", ")[[1]])
[1] "Philipp LAHM (GER) 6'"    "Paulo WANCHOPE (CRC) 12'" "Miroslav KLOSE (GER) 17'" "Miroslav KLOSE (GER) 61'" "Paulo WANCHOPE (CRC) 73'"
[6] "Torsten FRINGS (GER) 87'"

These questions are very helpful when dealing with web scraping and XML in R:

  1. Scraping html tables into R data frames using the XML package
  2. How to transform XML data into a data.frame?

With regards to your particular example, while I'm not sure what you want the output to look like, this gets the "goals scored" as a character vector:

theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
fifa.doc <- htmlParse(theURL)
fifa <- xpathSApply(fifa.doc, "//*/div[@class='cont']", xmlValue)
goals.scored <- grep("Goals scored", fifa, value=TRUE)

The xpathSApply function gets all the values that match the given criteria, and returns them as a vector. Note how I'm looking for a div with class='cont'. Using class values is frequently a good way to parse an HTML document because they are good markers.

You can clean this up however you want:

> gsub("Goals scored", "", strsplit(goals.scored, ", ")[[1]])
[1] "Philipp LAHM (GER) 6'"    "Paulo WANCHOPE (CRC) 12'" "Miroslav KLOSE (GER) 17'" "Miroslav KLOSE (GER) 61'" "Paulo WANCHOPE (CRC) 73'"
[6] "Torsten FRINGS (GER) 87'"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文