改进R中从google获取股票新闻数据的功能
我已经编写了一个函数来从 Google 获取和解析给定股票代码的新闻数据,但我确信有一些方法可以改进它。对于初学者来说,我的函数返回一个 GMT 时区的对象,而不是用户当前的时区,如果传递的数字大于 299,它就会失败(可能是因为 google 只返回每只股票 300 个故事)。这有点回应我自己的问题 堆栈溢出,并且严重依赖 这篇博文。
tl;dr: 我该如何改进这个功能?
getNews <- function(symbol, number){
# Warn about length
if (number>300) {
warning("May only get 300 stories from google")
}
# load libraries
require(XML); require(plyr); require(stringr); require(lubridate);
require(xts); require(RDSTK)
# construct url to news feed rss and encode it correctly
url.b1 = 'http://www.google.com/finance/company_news?q='
url = paste(url.b1, symbol, '&output=rss', "&start=", 1,
"&num=", number, sep = '')
url = URLencode(url)
# parse xml tree, get item nodes, extract data and return data frame
doc = xmlTreeParse(url, useInternalNodes = TRUE)
nodes = getNodeSet(doc, "//item")
mydf = ldply(nodes, as.data.frame(xmlToList))
# clean up names of data frame
names(mydf) = str_replace_all(names(mydf), "value\\.", "")
# convert pubDate to date-time object and convert time zone
pubDate = strptime(mydf$pubDate,
format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
pubDate = with_tz(pubDate, tz = 'America/New_york')
mydf$pubDate = NULL
#Parse the description field
mydf$description <- as.character(mydf$description)
parseDescription <- function(x) {
out <- html2text(x)$text
out <- strsplit(out,'\n|--')[[1]]
#Find Lead
TextLength <- sapply(out,nchar)
Lead <- out[TextLength==max(TextLength)]
#Find Site
Site <- out[3]
#Return cleaned fields
out <- c(Site,Lead)
names(out) <- c('Site','Lead')
out
}
description <- lapply(mydf$description,parseDescription)
description <- do.call(rbind,description)
mydf <- cbind(mydf,description)
#Format as XTS object
mydf = xts(mydf,order.by=pubDate)
# drop Extra attributes that we don't use yet
mydf$guid.text = mydf$guid..attrs = mydf$description = mydf$link = NULL
return(mydf)
}
I've written a function to grab and parse news data from Google for a given stock symbol, but I'm sure there are ways it could be improved. For starters, my function returns an object in the GMT timezone, rather than the user's current timezone, and it fails if passed a number greater than 299 (probably because google only returns 300 stories per stock). This is somewhat in response to my own question on stack overflow, and relies heavily on this blog post.
tl;dr: how can I improve this function?
getNews <- function(symbol, number){
# Warn about length
if (number>300) {
warning("May only get 300 stories from google")
}
# load libraries
require(XML); require(plyr); require(stringr); require(lubridate);
require(xts); require(RDSTK)
# construct url to news feed rss and encode it correctly
url.b1 = 'http://www.google.com/finance/company_news?q='
url = paste(url.b1, symbol, '&output=rss', "&start=", 1,
"&num=", number, sep = '')
url = URLencode(url)
# parse xml tree, get item nodes, extract data and return data frame
doc = xmlTreeParse(url, useInternalNodes = TRUE)
nodes = getNodeSet(doc, "//item")
mydf = ldply(nodes, as.data.frame(xmlToList))
# clean up names of data frame
names(mydf) = str_replace_all(names(mydf), "value\\.", "")
# convert pubDate to date-time object and convert time zone
pubDate = strptime(mydf$pubDate,
format = '%a, %d %b %Y %H:%M:%S', tz = 'GMT')
pubDate = with_tz(pubDate, tz = 'America/New_york')
mydf$pubDate = NULL
#Parse the description field
mydf$description <- as.character(mydf$description)
parseDescription <- function(x) {
out <- html2text(x)$text
out <- strsplit(out,'\n|--')[[1]]
#Find Lead
TextLength <- sapply(out,nchar)
Lead <- out[TextLength==max(TextLength)]
#Find Site
Site <- out[3]
#Return cleaned fields
out <- c(Site,Lead)
names(out) <- c('Site','Lead')
out
}
description <- lapply(mydf$description,parseDescription)
description <- do.call(rbind,description)
mydf <- cbind(mydf,description)
#Format as XTS object
mydf = xts(mydf,order.by=pubDate)
# drop Extra attributes that we don't use yet
mydf$guid.text = mydf$guid..attrs = mydf$description = mydf$link = NULL
return(mydf)
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是
getNews
函数的更短(可能更有效)版本此外,您的代码中可能存在错误,因为我尝试将它用于
symbol = 'WMT'
它返回了一个错误。我认为getNews2
也适用于 WMT。检查一下并告诉我它是否适合您。附言。
description
列仍然包含 html 代码。但从中提取文本应该很容易。当我有时间时我会发布更新Here is a shorter (and probably more efficient) version of your
getNews
functionMoreover, there might be a bug in your code, as I tried using it for
symbol = 'WMT'
and it returned an error. I thinkgetNews2
works fine for WMT too. Check it out and let me know if it works for you.PS. The
description
column still contains html code. But it should be easy to extract the text from it. I will post an update when I find time