抓取“元素周期表”的维基页面和所有链接
我希望抓取以下维基文章: http://en.wikipedia.org/wiki/Periodic_table
因此,我的 R 代码的输出将是一个包含以下列的表格:
- 化学元素简称
- 化学元素全名 化学
- 元素 wiki 页面的 URL
(显然,每个化学元素都有一行)
我正在尝试使用 XML 包获取页面内的值,但似乎停留在开头,所以我希望有一个有关如何执行此操作的示例(和/或相关示例的链接)
library(XML)
base_url<-"http://en.wikipedia.org/wiki/Periodic_table"
base_html<-getURLContent(base_url)[[1]]
parsed_html <- htmlTreeParse(base_html, useInternalNodes = TRUE)
xmlChildren(parsed_html)
getNodeSet(parsed_html, "//html", c(x = base_url))
[[1]]
attr(,"class")
[1] "XMLNodeSet"
I wish to scrape the following wiki article: http://en.wikipedia.org/wiki/Periodic_table
So that the output of my R code will be a table with the following columns:
- Chemical elements short name
- Chemical elements full name
- The URL to the chemical elements wiki page
(and with a row for each chemical element, obviously)
I am trying to get to the values inside the page using the XML package, but seems to be stuck in the beginning, so I'd appreciate an example on how to do it (and/or links to relevant examples)
library(XML)
base_url<-"http://en.wikipedia.org/wiki/Periodic_table"
base_html<-getURLContent(base_url)[[1]]
parsed_html <- htmlTreeParse(base_html, useInternalNodes = TRUE)
xmlChildren(parsed_html)
getNodeSet(parsed_html, "//html", c(x = base_url))
[[1]]
attr(,"class")
[1] "XMLNodeSet"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
试试这个:
一些输出:
我们可以通过从 Jeffrey 的 xpath 表达式(因为它几乎获取顶部的元素)开始进一步增强 xpath 表达式并为其添加精确的限定条件来使其更加紧凑。在这种情况下,可以使用 xpathSApply 来消除对 do.call 或 plyr 包的需要。我们解决零碎问题的最后一点与之前相同。这会产生一个矩阵而不是一个数据框,这似乎更可取,因为内容完全是字符。
Try this:
A bit of the output:
We can make this more compact by enhancing the xpath expression further starting with Jeffrey's xpath expression (since it nearly gets the elements at top) and adding a qualification to it which exactly does. In that case
xpathSApply
can be used to eliminate the need fordo.call
or the plyr package. The last bit where we fix up odds and ends is the same as before. This produces a matrix rather than a data frame which seems preferable since the content is entirely character.Tal——我以为这会很容易。我打算向您指出 readHTMLTable(),这是 XML 包中我最喜欢的函数。哎呀,它的帮助页面甚至显示了抓取维基百科页面的示例!
但可惜,这不是您想要的:
名称消失了,原子序数变成了符号。
回到绘图板...
我的 DOM 行走能力不是很强,所以这不太漂亮。它获取表格单元格中的每个链接,仅保留具有“标题”属性的链接(即符号所在的位置),并将您想要的内容粘贴到 data.frame 中。它也获取页面上的所有其他此类链接,但我们很幸运,元素是前 118 个此类链接:
Tal -- I thought this was going to be easy. I was going to point you to readHTMLTable(), my favorite function in the XML package. Heck, its help page even shows an example of scraping a Wikipedia page!
But alas, this is not what you want:
The names are gone and the atomic number runs into the symbol.
So back to the drawing board...
My DOM walking-fu is not very strong, so this isn't pretty. It gets every link in a table cell, only keeps those with a "title" attribute (that's where the symbol is), and sticks what you want in a data.frame. It gets every other such link on the page, too, but we're lucky and the elements are the first 118 such links:
您必须抓取维基百科吗?您可以改为针对 Wikidata 运行此 SPARQL 查询 (结果):
很抱歉,如果这不能直接回答您的问题,但这应该可以帮助人们以干净的方式获取相同的信息。
Do you have to scrape Wikipedia? You can run this SPARQL query against Wikidata instead (results):
Sorry if this doesn't answer your question directly but this should help people looking to scrape the same information but in a clean manner.