如何在 R 中从抓取的网页中分离单个元素
我想使用 R 来抓取此页面:(http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html)等,获取进球者和时间。
到目前为止,这就是我所得到的:
require(RCurl)
require(XML)
theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
webpage <- getURL(theURL, header=FALSE, verbose=TRUE)
webpagecont <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpagecont, error=function(...){}, useInternalNodes = TRUE)
并且 pagetree 对象现在包含一个指向我解析的 html 的指针(我认为)。我想要的部分是:
<div class="cont")<ul>
<div class="bold medium">Goals scored</div>
<li>Philipp LAHM (GER) 6', </li>
<li>Paulo WANCHOPE (CRC) 12', </li>
<li>Miroslav KLOSE (GER) 17', </li>
<li>Miroslav KLOSE (GER) 61', </li>
<li>Paulo WANCHOPE (CRC) 73', </li>
<li>Torsten FRINGS (GER) 87'</li>
</ul></div>
但我现在不知道如何隔离它们,坦率地说 xpathSApply
和 xpathApply
让我感到困惑!
那么,有谁知道如何制定命令来吸出
标签中包含的元素?I want to use R to scrape this page: (http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html ) and others, to get the goal scorers and times.
So far, this is what I've got:
require(RCurl)
require(XML)
theURL <-"http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=97410001/report.html"
webpage <- getURL(theURL, header=FALSE, verbose=TRUE)
webpagecont <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpagecont, error=function(...){}, useInternalNodes = TRUE)
and the pagetree object now contains a pointer to my parsed html (I think). The part I want is:
<div class="cont")<ul>
<div class="bold medium">Goals scored</div>
<li>Philipp LAHM (GER) 6', </li>
<li>Paulo WANCHOPE (CRC) 12', </li>
<li>Miroslav KLOSE (GER) 17', </li>
<li>Miroslav KLOSE (GER) 61', </li>
<li>Paulo WANCHOPE (CRC) 73', </li>
<li>Torsten FRINGS (GER) 87'</li>
</ul></div>
But I'm now lost as to how to isolate them, and frankly xpathSApply
and xpathApply
confuse the beejeebies out of me!
So, does anyone know how to formulate a command to suck out the element contained within the <div class="cont">
tags?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在处理 R 中的网页抓取和 XML 时,这些问题非常有帮助:
对于您的特定示例,虽然我不确定您希望输出是什么样子,但这会得到“得分目标”作为字符向量:
xpathSApply
函数获取与给定条件匹配的所有值,并将它们作为向量返回。请注意我如何寻找 class='cont' 的 div。使用类值通常是解析 HTML 文档的好方法,因为它们是很好的标记。您可以根据需要清理它:
These questions are very helpful when dealing with web scraping and XML in R:
With regards to your particular example, while I'm not sure what you want the output to look like, this gets the "goals scored" as a character vector:
The
xpathSApply
function gets all the values that match the given criteria, and returns them as a vector. Note how I'm looking for a div with class='cont'. Using class values is frequently a good way to parse an HTML document because they are good markers.You can clean this up however you want: