抓取“元素周期表”的维基页面和所有链接

发布于 2024-10-06 13:56:44 字数 664 浏览 0 评论 0原文

我希望抓取以下维基文章: http://en.wikipedia.org/wiki/Periodic_table

因此,我的 R 代码的输出将是一个包含以下列的表格:

  • 化学元素简称
  • 化学元素全名 化学
  • 元素 wiki 页面的 URL

(显然,每个化学元素都有一行)

我正在尝试使用 XML 包获取页面内的值,但似乎停留在开头,所以我希望有一个有关如何执行此操作的示例(和/或相关示例的链接)

library(XML)
base_url<-"http://en.wikipedia.org/wiki/Periodic_table"
base_html<-getURLContent(base_url)[[1]]
parsed_html <- htmlTreeParse(base_html, useInternalNodes = TRUE)
xmlChildren(parsed_html)
getNodeSet(parsed_html, "//html", c(x = base_url))
[[1]]
attr(,"class")
[1] "XMLNodeSet"

I wish to scrape the following wiki article: http://en.wikipedia.org/wiki/Periodic_table

So that the output of my R code will be a table with the following columns:

  • Chemical elements short name
  • Chemical elements full name
  • The URL to the chemical elements wiki page

(and with a row for each chemical element, obviously)

I am trying to get to the values inside the page using the XML package, but seems to be stuck in the beginning, so I'd appreciate an example on how to do it (and/or links to relevant examples)

library(XML)
base_url<-"http://en.wikipedia.org/wiki/Periodic_table"
base_html<-getURLContent(base_url)[[1]]
parsed_html <- htmlTreeParse(base_html, useInternalNodes = TRUE)
xmlChildren(parsed_html)
getNodeSet(parsed_html, "//html", c(x = base_url))
[[1]]
attr(,"class")
[1] "XMLNodeSet"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

千笙结 2024-10-13 13:56:46

试试这个:

library(XML)

URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)

# extract attributes and value of all 'a' tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
m1 <- xpathApply(root, "//table[3]//a", f)
m2 <- suppressWarnings(do.call(rbind, m1))

# extract rows that correspond to chemical symbols
ix <- grep("^[[:upper:]][[:lower:]]{0,2}", m2[, "class"])

m3 <- m2[ix, 1:3]
colnames(m3) <- c("URL", "Name", "Symbol")
m3[,1] <- sub("^", "http://en.wikipedia.org", m3[,1])
m3[,2] <- sub(" .*", "", m3[,2])

一些输出:

> dim(m3)
[1] 118   3
> head(m3)
     URL                                      Name        Symbol
[1,] "http://en.wikipedia.org/wiki/Hydrogen"  "Hydrogen"  "H"   
[2,] "http://en.wikipedia.org/wiki/Helium"    "Helium"    "He"  
[3,] "http://en.wikipedia.org/wiki/Lithium"   "Lithium"   "Li"  
[4,] "http://en.wikipedia.org/wiki/Beryllium" "Beryllium" "Be"  
[5,] "http://en.wikipedia.org/wiki/Boron"     "Boron"     "B"   
[6,] "http://en.wikipedia.org/wiki/Carbon"    "Carbon"    "C"   

我们可以通过从 Jeffrey 的 xpath 表达式(因为它几乎获取顶部的元素)开始进一步增强 xpath 表达式并为其添加精确的限定条件来使其更加紧凑。在这种情况下,可以使用 xpathSApply 来消除对 do.call 或 plyr 包的需要。我们解决零碎问题的最后一点与之前相同。这会产生一个矩阵而不是一个数据框,这似乎更可取,因为内容完全是字符。

library(XML)

URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)

# extract attributes and value of all a tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
M <- t(xpathSApply(root, "//table[3]/tr/td/a[.!='']", f))[1:118,]

# nicer column names, fix up URLs, fix up Mercury.
colnames(M) <- c("URL", "Name", "Symbol")
M[,1] <- sub("^", "http://en.wikipedia.org", M[,1])
M[,2] <- sub(" .*", "", M[,2])

View(M)

Try this:

library(XML)

URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)

# extract attributes and value of all 'a' tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
m1 <- xpathApply(root, "//table[3]//a", f)
m2 <- suppressWarnings(do.call(rbind, m1))

# extract rows that correspond to chemical symbols
ix <- grep("^[[:upper:]][[:lower:]]{0,2}", m2[, "class"])

m3 <- m2[ix, 1:3]
colnames(m3) <- c("URL", "Name", "Symbol")
m3[,1] <- sub("^", "http://en.wikipedia.org", m3[,1])
m3[,2] <- sub(" .*", "", m3[,2])

A bit of the output:

> dim(m3)
[1] 118   3
> head(m3)
     URL                                      Name        Symbol
[1,] "http://en.wikipedia.org/wiki/Hydrogen"  "Hydrogen"  "H"   
[2,] "http://en.wikipedia.org/wiki/Helium"    "Helium"    "He"  
[3,] "http://en.wikipedia.org/wiki/Lithium"   "Lithium"   "Li"  
[4,] "http://en.wikipedia.org/wiki/Beryllium" "Beryllium" "Be"  
[5,] "http://en.wikipedia.org/wiki/Boron"     "Boron"     "B"   
[6,] "http://en.wikipedia.org/wiki/Carbon"    "Carbon"    "C"   

We can make this more compact by enhancing the xpath expression further starting with Jeffrey's xpath expression (since it nearly gets the elements at top) and adding a qualification to it which exactly does. In that case xpathSApply can be used to eliminate the need for do.call or the plyr package. The last bit where we fix up odds and ends is the same as before. This produces a matrix rather than a data frame which seems preferable since the content is entirely character.

library(XML)

URL <- "http://en.wikipedia.org/wiki/Periodic_table"
root <- htmlTreeParse(URL, useInternalNodes = TRUE)

# extract attributes and value of all a tags within 3rd table
f <- function(x) c(xmlAttrs(x), xmlValue(x))
M <- t(xpathSApply(root, "//table[3]/tr/td/a[.!='']", f))[1:118,]

# nicer column names, fix up URLs, fix up Mercury.
colnames(M) <- c("URL", "Name", "Symbol")
M[,1] <- sub("^", "http://en.wikipedia.org", M[,1])
M[,2] <- sub(" .*", "", M[,2])

View(M)
枕梦 2024-10-13 13:56:46

Tal——我以为这会很容易。我打算向您指出 readHTMLTable(),这是 XML 包中我最喜欢的函数。哎呀,它的帮助页面甚至显示了抓取维基百科页面的示例!

但可惜,这不是您想要的:

library(XML)
url = 'http://en.wikipedia.org/wiki/Periodic_table'
tables = readHTMLTable(html)

# ... look through the list to find the one you want...

table = tables[3]
table

Tal——我以为这会很容易。我打算向您指出 readHTMLTable(),这是 XML 包中我最喜欢的函数。哎呀,它的帮助页面甚至显示了抓取维基百科页面的示例!

但可惜,这不是您想要的:

NULL` Group # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 Period <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 2 1 1H 2He <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 3 2 3Li 4Be 5B 6C 7N 8O 9F 10Ne <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 4 3 11Na 12Mg 13Al 14Si 15P 16S 17Cl 18Ar <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 5 4 19K 20Ca 21Sc 22Ti 23V 24Cr 25Mn 26Fe 27Co 28Ni 29Cu 30Zn 31Ga 32Ge 33As 34Se 35Br 36Kr 6 5 37Rb 38Sr 39Y 40Zr 41Nb 42Mo 43Tc 44Ru 45Rh 46Pd 47Ag 48Cd 49In 50Sn 51Sb 52Te 53I 54Xe 7 6 55Cs 56Ba * 72Hf 73Ta 74W 75Re 76Os 77Ir 78Pt 79Au 80Hg 81Tl 82Pb 83Bi 84Po 85At 86Rn 8 7 87Fr 88Ra ** 104Rf 105Db 106Sg 107Bh 108Hs 109Mt 110Ds 111Rg 112Cn 113Uut 114Uuq 115Uup 116Uuh 117Uus 118Uuo 9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 10 * Lanthanoids 57La 58Ce 59Pr 60Nd 61Pm 62Sm 63Eu 64Gd 65Tb 66Dy 67Ho 68Er 69Tm 70Yb 71Lu <NA> <NA> 11 ** Actinoids 89Ac 90Th 91Pa 92U 93Np 94Pu 95Am 96Cm 97Bk 98Cf 99Es 100Fm 101Md 102No 103Lr <NA> <NA>

名称消失了,原子序数变成了符号。

回到绘图板...

我的 DOM 行走能力不是很强,所以这不太漂亮。它获取表格单元格中的每个链接,仅保留具有“标题”属性的链接(即符号所在的位置),并将您想要的内容粘贴到 data.frame 中。它也获取页面上的所有其他此类链接,但我们很幸运,元素是前 118 个此类链接:

library(XML)
library(plyr) 

url = 'http://en.wikipedia.org/wiki/Periodic_table'

# don't forget to parse the HTML, doh!

doc = htmlParse(url)

# get every link in a table cell:

links = getNodeSet(doc, '//table/tr/td/a')

# make a data.frame for each node with non-blank text, link, and 'title' attribute:

df = ldply(links, function(x) {
            text = xmlValue(x)
            if (text=='') text=NULL

            symbol = xmlGetAttr(x, 'title')
            link = xmlGetAttr(x, 'href')
            if (!is.null(text) & !is.null(symbol) & !is.null(link))
                data.frame(symbol, text, link)
        } )

# only keep the actual elements -- we're lucky they're first!

df = head(df, 118)

head(df)
     symbol text            link
1  Hydrogen    H  /wiki/Hydrogen
2    Helium   He    /wiki/Helium
3   Lithium   Li   /wiki/Lithium
4 Beryllium   Be /wiki/Beryllium
5     Boron    B     /wiki/Boron
6    Carbon    C    /wiki/Carbon

Tal -- I thought this was going to be easy. I was going to point you to readHTMLTable(), my favorite function in the XML package. Heck, its help page even shows an example of scraping a Wikipedia page!

But alas, this is not what you want:

library(XML)
url = 'http://en.wikipedia.org/wiki/Periodic_table'
tables = readHTMLTable(html)

# ... look through the list to find the one you want...

table = tables[3]
table

Tal -- I thought this was going to be easy. I was going to point you to readHTMLTable(), my favorite function in the XML package. Heck, its help page even shows an example of scraping a Wikipedia page!

But alas, this is not what you want:

NULL` Group # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 Period <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 2 1 1H 2He <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 3 2 3Li 4Be 5B 6C 7N 8O 9F 10Ne <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 4 3 11Na 12Mg 13Al 14Si 15P 16S 17Cl 18Ar <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 5 4 19K 20Ca 21Sc 22Ti 23V 24Cr 25Mn 26Fe 27Co 28Ni 29Cu 30Zn 31Ga 32Ge 33As 34Se 35Br 36Kr 6 5 37Rb 38Sr 39Y 40Zr 41Nb 42Mo 43Tc 44Ru 45Rh 46Pd 47Ag 48Cd 49In 50Sn 51Sb 52Te 53I 54Xe 7 6 55Cs 56Ba * 72Hf 73Ta 74W 75Re 76Os 77Ir 78Pt 79Au 80Hg 81Tl 82Pb 83Bi 84Po 85At 86Rn 8 7 87Fr 88Ra ** 104Rf 105Db 106Sg 107Bh 108Hs 109Mt 110Ds 111Rg 112Cn 113Uut 114Uuq 115Uup 116Uuh 117Uus 118Uuo 9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 10 * Lanthanoids 57La 58Ce 59Pr 60Nd 61Pm 62Sm 63Eu 64Gd 65Tb 66Dy 67Ho 68Er 69Tm 70Yb 71Lu <NA> <NA> 11 ** Actinoids 89Ac 90Th 91Pa 92U 93Np 94Pu 95Am 96Cm 97Bk 98Cf 99Es 100Fm 101Md 102No 103Lr <NA> <NA>

The names are gone and the atomic number runs into the symbol.

So back to the drawing board...

My DOM walking-fu is not very strong, so this isn't pretty. It gets every link in a table cell, only keeps those with a "title" attribute (that's where the symbol is), and sticks what you want in a data.frame. It gets every other such link on the page, too, but we're lucky and the elements are the first 118 such links:

library(XML)
library(plyr) 

url = 'http://en.wikipedia.org/wiki/Periodic_table'

# don't forget to parse the HTML, doh!

doc = htmlParse(url)

# get every link in a table cell:

links = getNodeSet(doc, '//table/tr/td/a')

# make a data.frame for each node with non-blank text, link, and 'title' attribute:

df = ldply(links, function(x) {
            text = xmlValue(x)
            if (text=='') text=NULL

            symbol = xmlGetAttr(x, 'title')
            link = xmlGetAttr(x, 'href')
            if (!is.null(text) & !is.null(symbol) & !is.null(link))
                data.frame(symbol, text, link)
        } )

# only keep the actual elements -- we're lucky they're first!

df = head(df, 118)

head(df)
     symbol text            link
1  Hydrogen    H  /wiki/Hydrogen
2    Helium   He    /wiki/Helium
3   Lithium   Li   /wiki/Lithium
4 Beryllium   Be /wiki/Beryllium
5     Boron    B     /wiki/Boron
6    Carbon    C    /wiki/Carbon
命硬 2024-10-13 13:56:46

您必须抓取维基百科吗?您可以改为针对 Wikidata 运行此 SPARQL 查询 (结果):

SELECT
  ?elementLabel
  ?symbol
  ?article
WHERE
{
  ?element wdt:P31 wd:Q11344;
           wdt:P1086 ?n;
           wdt:P246 ?symbol.
  OPTIONAL {
    ?article schema:about ?element;
             schema:inLanguage "en";
             schema:isPartOf <https://en.wikipedia.org/>.
  }
  FILTER (?n >= 1 && ?n <= 118).
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
ORDER BY ?n

很抱歉,如果这不能直接回答您的问题,但这应该可以帮助人们以干净的方式获取相同的信息。

Do you have to scrape Wikipedia? You can run this SPARQL query against Wikidata instead (results):

SELECT
  ?elementLabel
  ?symbol
  ?article
WHERE
{
  ?element wdt:P31 wd:Q11344;
           wdt:P1086 ?n;
           wdt:P246 ?symbol.
  OPTIONAL {
    ?article schema:about ?element;
             schema:inLanguage "en";
             schema:isPartOf <https://en.wikipedia.org/>.
  }
  FILTER (?n >= 1 && ?n <= 118).
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
ORDER BY ?n

Sorry if this doesn't answer your question directly but this should help people looking to scrape the same information but in a clean manner.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文