页面抓取以从谷歌财经获取价格

发布于 2024-10-31 12:56:48 字数 1159 浏览 5 评论 0原文

我试图通过抓取谷歌金融页面来获取股票价格,我在 python 中使用 urllib 包,然后使用正则表达式来获取价格数据。

当我让 python 脚本运行时,它最初会工作一段时间(几分钟),然后开始抛出异常 [HTTP 错误 503: 服务不可用]

我猜发生这种情况是因为在 Web 服务器端,它作为机器人检测到频繁的页面更新,并且一段时间后抛出这个异常..

有没有办法解决这个问题,即删除一些cookie或创建一些cookie等..

或者如果谷歌提供一些api就更好了,我想在python中执行此操作,因为python中有完整的应用程序,但是如果 python 中没有可用的方法可以做到这一点,我可以考虑替代方案。这是我在循环中使用的 python 方法来获取数据(在几秒钟的睡眠后,我在循环中调用此方法)

 def getPriceFromGOOGLE(self, symbol):
    """ 
    gets last traded price from google for given security
    """         
    toReturn = 0.0
    try:
        base_url = 'http://google.com/finance?q='
        req = urllib2.Request(base_url + symbol)
        content = urllib2.urlopen(req).read()
        namestr = 'name:\"' + symbol + '\",cp:(.*),p:(.*),cid(.*)}'
        m = re.search(namestr, content)
        if m:
            data = str(m.group(2).strip().strip('"'))
            price = data.replace(',','')
            toReturn = float(price)
        else:
            print 'ERROR ' + str(symbol) + ' --- ' + str(content)      
    except Exception, exc:
        print 'Exc: ' + str(exc)       
    finally: 
        return toReturn

I am trying to get stock prices by scraping google finance pages, I am doing this in python, using urllib package and then using regex to get price data.

When I leave my python script running, it works initially for some time (few minutes) and then starts throwing exception [HTTP Error 503: Service Unavailable]

I guess this is happening because on web server side it detects frequent page updates as a robot and throws this exception after a while..

is there a way around this, i.e. deleting some cookie or creating some cookie etc..

or even better if google gives some api, I want to do this in python because the complete app in python, but if there is nothing available in python to do this, I can consider alternatives. This is my python method that I use in loop to get data ( with few seconds of sleep I call this method in loop)

 def getPriceFromGOOGLE(self, symbol):
    """ 
    gets last traded price from google for given security
    """         
    toReturn = 0.0
    try:
        base_url = 'http://google.com/finance?q='
        req = urllib2.Request(base_url + symbol)
        content = urllib2.urlopen(req).read()
        namestr = 'name:\"' + symbol + '\",cp:(.*),p:(.*),cid(.*)}'
        m = re.search(namestr, content)
        if m:
            data = str(m.group(2).strip().strip('"'))
            price = data.replace(',','')
            toReturn = float(price)
        else:
            print 'ERROR ' + str(symbol) + ' --- ' + str(content)      
    except Exception, exc:
        print 'Exc: ' + str(exc)       
    finally: 
        return toReturn

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

南薇 2024-11-07 12:56:48

该问题很旧,但所选答案不再有效。
该 API 已被弃用。

有一个开源项目可以从 Google 金融中抓取所有公司,并将其与当前价格进行匹配,网址为 http:// /scrape-google-finance.compunect.com/
该项目解决了大部分问题,包括缓存、IP 管理,并且工作稳定,不会被阻止。
它使用内部财务公司匹配API来抓取公司和图表API来获取价格。
但它是 php 代码,而不是 python。您仍然可以了解它如何解决任务并进行调整。

The question is quite old but the selected answer is not valid anymore.
The API has been deprecated.

There is an open source project to scrape all companies from Google finance and match them with their current price at http://scrape-google-finance.compunect.com/
The project solved most issues, includes caching, IP management and works stable without getting blocked.
It uses the internal finance company matching api to scrape companies and the chart api to get prices.
However it is php code, not python. You can still learn how it solved the tasks and adapt it.

舟遥客 2024-11-07 12:56:48

要绕过 Google、维基百科或雅虎等网站的大多数速率限制或机器人检测,请欺骗您的用户代理。

这将使您的脚本的请求看起来像是来自最新版本的 Google Chrome。

headers = {'User-Agent' : "Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.16 Safari/534.24"}
req = urllib2.Request(url,None,headers)
content = urllib2.urlopen(req).read()

To get around most rate-limiting or bot-detection from the likes of Google or Wikipedia or Yahoo, spoof your user-agent.

This will make your script's requests appear to be from the latest version of Google Chrome.

headers = {'User-Agent' : "Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.16 Safari/534.24"}
req = urllib2.Request(url,None,headers)
content = urllib2.urlopen(req).read()
街角卖回忆 2024-11-07 12:56:48

雅虎财经也是获取涵盖更多国家和股票的金融信息的好地方。

对于 python 2,您可以使用 ystockquote。对于 python 3,您可以使用我从上一个重写的 yfq

获取 Google 和 Intel 的最新报价。

>>> import yfq
>>> yfq.get_price('GOOG+INTL')
{'GOOG': '600.25', 'INTL': '22.25'}

获取雅虎2012年3月3日至2012年3月5日的历史报价。

>>> yfq.get_historical_prices('YHOO','20120301','20120303')
[['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], ['2012-03-02', '14.89', '14.92', '14.66', '14.72', '9164900', '14.72'], ['2012-03-01', '14.89', '14.96', '14.79', '14.93', '12283300', '14.93']]

Yahoo Finance is also a good place to get financial information which covers more countries and stocks.

For python 2, you can use ystockquote. For python 3, you can use yfq that I rewrite from the previous one.

To get current quotes of Google and Intel.

>>> import yfq
>>> yfq.get_price('GOOG+INTL')
{'GOOG': '600.25', 'INTL': '22.25'}

To get historical quotes of Yahoo from March 3, 2012 to March 5, 2012.

>>> yfq.get_historical_prices('YHOO','20120301','20120303')
[['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], ['2012-03-02', '14.89', '14.92', '14.66', '14.72', '9164900', '14.72'], ['2012-03-01', '14.89', '14.96', '14.79', '14.93', '12283300', '14.93']]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文