页面抓取以从谷歌财经获取价格

发布于 2024-10-31 12:56:48 字数 1159 浏览 14 评论 0原文

我试图通过抓取谷歌金融页面来获取股票价格，我在 python 中使用 urllib 包，然后使用正则表达式来获取价格数据。

当我让 python 脚本运行时，它最初会工作一段时间（几分钟），然后开始抛出异常 [HTTP 错误 503: 服务不可用]

我猜发生这种情况是因为在 Web 服务器端，它作为机器人检测到频繁的页面更新，并且一段时间后抛出这个异常..

有没有办法解决这个问题，即删除一些cookie或创建一些cookie等..

或者如果谷歌提供一些api就更好了，我想在python中执行此操作，因为python中有完整的应用程序，但是如果 python 中没有可用的方法可以做到这一点，我可以考虑替代方案。这是我在循环中使用的 python 方法来获取数据（在几秒钟的睡眠后，我在循环中调用此方法）

 def getPriceFromGOOGLE(self, symbol):
    """ 
    gets last traded price from google for given security
    """         
    toReturn = 0.0
    try:
        base_url = 'http://google.com/finance?q='
        req = urllib2.Request(base_url + symbol)
        content = urllib2.urlopen(req).read()
        namestr = 'name:\"' + symbol + '\",cp:(.*),p:(.*),cid(.*)}'
        m = re.search(namestr, content)
        if m:
            data = str(m.group(2).strip().strip('"'))
            price = data.replace(',','')
            toReturn = float(price)
        else:
            print 'ERROR ' + str(symbol) + ' --- ' + str(content)      
    except Exception, exc:
        print 'Exc: ' + str(exc)       
    finally: 
        return toReturn

原文

I am trying to get stock prices by scraping google finance pages, I am doing this in python, using urllib package and then using regex to get price data.

When I leave my python script running, it works initially for some time (few minutes) and then starts throwing exception [HTTP Error 503: Service Unavailable]

I guess this is happening because on web server side it detects frequent page updates as a robot and throws this exception after a while..

is there a way around this, i.e. deleting some cookie or creating some cookie etc..

or even better if google gives some api, I want to do this in python because the complete app in python, but if there is nothing available in python to do this, I can consider alternatives. This is my python method that I use in loop to get data ( with few seconds of sleep I call this method in loop)

 def getPriceFromGOOGLE(self, symbol):
    """ 
    gets last traded price from google for given security
    """         
    toReturn = 0.0
    try:
        base_url = 'http://google.com/finance?q='
        req = urllib2.Request(base_url + symbol)
        content = urllib2.urlopen(req).read()
        namestr = 'name:\"' + symbol + '\",cp:(.*),p:(.*),cid(.*)}'
        m = re.search(namestr, content)
        if m:
            data = str(m.group(2).strip().strip('"'))
            price = data.replace(',','')
            toReturn = float(price)
        else:
            print 'ERROR ' + str(symbol) + ' --- ' + str(content)      
    except Exception, exc:
        print 'Exc: ' + str(exc)       
    finally: 
        return toReturn

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

南薇 2024-11-07 12:56:48

该问题很旧，但所选答案不再有效。
该 API 已被弃用。

有一个开源项目可以从 Google 金融中抓取所有公司，并将其与当前价格进行匹配，网址为 http:// /scrape-google-finance.compunect.com/
该项目解决了大部分问题，包括缓存、IP 管理，并且工作稳定，不会被阻止。
它使用内部财务公司匹配API来抓取公司和图表API来获取价格。
但它是 php 代码，而不是 python。您仍然可以了解它如何解决任务并进行调整。

回复收藏 0 原文

舟遥客 2024-11-07 12:56:48

要绕过 Google、维基百科或雅虎等网站的大多数速率限制或机器人检测，请欺骗您的用户代理。

这将使您的脚本的请求看起来像是来自最新版本的 Google Chrome。

headers = {'User-Agent' : "Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.16 Safari/534.24"}
req = urllib2.Request(url,None,headers)
content = urllib2.urlopen(req).read()

To get around most rate-limiting or bot-detection from the likes of Google or Wikipedia or Yahoo, spoof your user-agent.

This will make your script's requests appear to be from the latest version of Google Chrome.

headers = {'User-Agent' : "Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.16 Safari/534.24"}
req = urllib2.Request(url,None,headers)
content = urllib2.urlopen(req).read()

回复收藏 0 原文

街角卖回忆 2024-11-07 12:56:48

雅虎财经也是获取涵盖更多国家和股票的金融信息的好地方。

对于 python 2，您可以使用 ystockquote。对于 python 3，您可以使用我从上一个重写的 yfq 。

获取 Google 和 Intel 的最新报价。

>>> import yfq
>>> yfq.get_price('GOOG+INTL')
{'GOOG': '600.25', 'INTL': '22.25'}

获取雅虎2012年3月3日至2012年3月5日的历史报价。

>>> yfq.get_historical_prices('YHOO','20120301','20120303')
[['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], ['2012-03-02', '14.89', '14.92', '14.66', '14.72', '9164900', '14.72'], ['2012-03-01', '14.89', '14.96', '14.79', '14.93', '12283300', '14.93']]

Yahoo Finance is also a good place to get financial information which covers more countries and stocks.

For python 2, you can use ystockquote. For python 3, you can use yfq that I rewrite from the previous one.

To get current quotes of Google and Intel.

>>> import yfq
>>> yfq.get_price('GOOG+INTL')
{'GOOG': '600.25', 'INTL': '22.25'}

To get historical quotes of Yahoo from March 3, 2012 to March 5, 2012.

>>> yfq.get_historical_prices('YHOO','20120301','20120303')
[['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], ['2012-03-02', '14.89', '14.92', '14.66', '14.72', '9164900', '14.72'], ['2012-03-01', '14.89', '14.96', '14.79', '14.93', '12283300', '14.93']]

回复收藏 0 原文