从网站中提取 html 表

发布于 2024-11-05 09:03:55 字数 892 浏览 7 评论 0原文

我正在尝试使用 XML、RCurl 包来读取以下 URL 的一些 html 表 http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#

这是代码我正在使用

library(RCurl)
library(XML)
options(RCurlOptions = list(useragent = "R"))
url <- "http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#"
wp <- getURLContent(url)
doc <- htmlParse(wp, asText = TRUE) 
docName(doc) <- url
tmp <- readHTMLTable(doc)
## Required tables 
tmp[[13]]
tmp[[14]]

如果您查看表格，它无法解析网页中的值。我猜这是由于一些即时发生的 javascipt 评估造成的。现在，如果我在 google chrome 中使用“页面另存为”选项（它在 mozilla 中不起作用）并保存页面，然后使用上面的代码我可以读取值。

但是有没有办法让我可以阅读苍蝇的表格？如果你能帮忙那就太好了。

问候，

原文

I am trying to use XML, RCurl package to read some html tables of the following URL
http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#

Here is the code I am using

library(RCurl)
library(XML)
options(RCurlOptions = list(useragent = "R"))
url <- "http://www.nse-india.com/marketinfo/equities/cmquote.jsp?key=SBINEQN&symbol=SBIN&flag=0&series=EQ#"
wp <- getURLContent(url)
doc <- htmlParse(wp, asText = TRUE) 
docName(doc) <- url
tmp <- readHTMLTable(doc)
## Required tables 
tmp[[13]]
tmp[[14]]

If you look at the tables it has not been able to parse the values from the webpage.
I guess this due to some javascipt evaluation happening on the fly.
Now if I use "save page as" option in google chrome(it does not work in mozilla)
and save the page and then use the above code i am able to read in the values.

But is there a work around so that I can read the table of the fly ?
It will be great if you can help.

Regards,

分享到QQ

分享到微博