Python脚本在不下载整个页面的情况下查看网页是否存在？

发布于 2024-11-17 01:52:49 字数 738 浏览 10 评论 0原文

我正在尝试编写一个脚本来测试网页是否存在，如果它能够在不下载整个页面的情况下进行检查，那就太好了。

这是我的出发点，我已经看到多个示例以相同的方式使用 httplib，但是，我检查的每个站点都只是返回 false。

import httplib
from httplib import HTTP
from urlparse import urlparse

def checkUrl(url):
    p = urlparse(url)
    h = HTTP(p[1])
    h.putrequest('HEAD', p[2])
    h.endheaders()
    return h.getreply()[0] == httplib.OK

if __name__=="__main__":
    print checkUrl("http://www.stackoverflow.com") # True
    print checkUrl("http://stackoverflow.com/notarealpage.html") # False

有什么想法吗？

编辑

有人建议这样做，但他们的帖子已被删除.. urllib2 是否避免下载整个页面？

import urllib2

try:
    urllib2.urlopen(some_url)
    return True
except urllib2.URLError:
    return False

原文

I'm trying to write a script to test for the existence of a web page, would be nice if it would check without downloading the whole page.

This is my jumping off point, I've seen multiple examples use httplib in the same way, however, every site I check simply returns false.

import httplib
from httplib import HTTP
from urlparse import urlparse

def checkUrl(url):
    p = urlparse(url)
    h = HTTP(p[1])
    h.putrequest('HEAD', p[2])
    h.endheaders()
    return h.getreply()[0] == httplib.OK

if __name__=="__main__":
    print checkUrl("http://www.stackoverflow.com") # True
    print checkUrl("http://stackoverflow.com/notarealpage.html") # False

Any ideas?

Edit

Someone suggested this, but their post was deleted.. does urllib2 avoid downloading the whole page?

import urllib2

try:
    urllib2.urlopen(some_url)
    return True
except urllib2.URLError:
    return False

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花开雨落又逢春i 2024-11-24 01:52:50

这个怎么样。

import requests

def url_check(url):
    #Description

    """Boolean return - check to see if the site exists.
       This function takes a url as input and then it requests the site 
       head - not the full html and then it checks the response to see if 
       it's less than 400. If it is less than 400 it will return TRUE 
       else it will return False.
    """
    try:
            site_ping = requests.head(url)
            if site_ping.status_code < 400:
                #  To view the return status code, type this   :   **print(site.ping.status_code)** 
                return True
            else:
                return False
    except Exception:
        return False

How about this.

import requests

def url_check(url):
    #Description

    """Boolean return - check to see if the site exists.
       This function takes a url as input and then it requests the site 
       head - not the full html and then it checks the response to see if 
       it's less than 400. If it is less than 400 it will return TRUE 
       else it will return False.
    """
    try:
            site_ping = requests.head(url)
            if site_ping.status_code < 400:
                #  To view the return status code, type this   :   **print(site.ping.status_code)** 
                return True
            else:
                return False
    except Exception:
        return False

回复收藏 0 原文

☆獨立☆ 2024-11-24 01:52:50

你可以尝试

import urllib2

try:
    urllib2.urlopen(url='https://someURL')
except:
    print("page not found")

You can try

import urllib2

try:
    urllib2.urlopen(url='https://someURL')
except:
    print("page not found")

回复收藏 0 原文

堇年纸鸢 2024-11-24 01:52:49

怎么样：

import httplib
from urlparse import urlparse

def checkUrl(url):
    p = urlparse(url)
    conn = httplib.HTTPConnection(p.netloc)
    conn.request('HEAD', p.path)
    resp = conn.getresponse()
    return resp.status < 400

if __name__ == '__main__':
    print checkUrl('http://www.stackoverflow.com') # True
    print checkUrl('http://stackoverflow.com/notarealpage.html') # False

这将发送一个 HTTP HEAD 请求，如果响应状态代码 <<，则返回 True。 400.

请注意，StackOverflow 的根路径返回重定向 (301)，而不是 200 OK。

how about this:

import httplib
from urlparse import urlparse

def checkUrl(url):
    p = urlparse(url)
    conn = httplib.HTTPConnection(p.netloc)
    conn.request('HEAD', p.path)
    resp = conn.getresponse()
    return resp.status < 400

if __name__ == '__main__':
    print checkUrl('http://www.stackoverflow.com') # True
    print checkUrl('http://stackoverflow.com/notarealpage.html') # False

this will send an HTTP HEAD request and return True if the response status code is < 400.

notice that StackOverflow's root path returns a redirect (301), not a 200 OK.

回复收藏 0 原文

源来凯始玺欢你 2024-11-24 01:52:49

使用requests，这很简单：

import requests

ret = requests.head('http://www.example.com')
print(ret.status_code)

这只是加载网站的标头。要测试此操作是否成功，您可以检查结果 status_code。或者使用 raise_for_status 方法，如果连接不成功，该方法会引发 Exception。

Using requests, this is as simple as:

import requests

ret = requests.head('http://www.example.com')
print(ret.status_code)

This just loads the website's header. To test if this was successfull, you can check the results status_code. Or use the raise_for_status method which raises an Exception if the connection was not succesfull.

回复收藏 0 原文

~没有更多了~