Python urllib2 和 [errno 10054] 现有连接被远程主机强制关闭以及一些 urllib2 问题

发布于 2024-11-26 09:49:35 字数 4137 浏览 6 评论 0原文

我编写了一个使用 urllib2 来获取 URL 的爬虫。

每隔几个请求我就会收到一些奇怪的行为，我尝试用 Wireshark 对其进行分析，但无法理解问题。

getPAGE() 负责获取 URL。如果成功获取 URL，则返回 URL 的内容 (response.read())，否则返回 None。

def getPAGE(FetchAddress):
    attempts = 0
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'}
    while attempts < 2:
        req = Request(FetchAddress, None ,headers)
        try:
            response = urlopen(req) #fetching the url
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            return response.read()
    return None

这是调用 getPAGE() 并检查我获取的页面是否有效的函数（使用 - companyID = soup.find('span',id='lblCompanyNumber' 检查） ).string 如果 companyID 为 None，则页面无效），如果页面有效，则将 soup 对象保存到名为“curRes”的全局变量中。

def isValid(ID):
    global curRes
    try:
        address = urlPath+str(ID)
        page = getPAGE(address)
        if page == None:
            saveToCsv(ID, badRequest = True)
            return False
    except Exception, e:
        print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
    else:
        try:
            soup = BeautifulSoup(page)
        except TypeError, e:
            print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address
            return False
        try:
            companyID = soup.find('span',id='lblCompanyNumber').string
            if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file
                saveToCsv(ID, isEmpty = True)
                return False
            else:
                curRes = soup #we have the data we need, save the soup obj to a global variable
                return True
        except Exception, e:
            print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address
            return False

奇怪的行为是 -

有时 urllib2 执行 GET 请求，但没有等待它发送下一个 GET 请求的回复（忽略最后一个请求），
有时我会收到“[errno 10054]现有连接被远程主机强制关闭”，代码只是卡住了大约大约 20 分钟等待服务器的响应，当它卡住时，我复制 URL 并尝试手动获取它，我在不到 1 秒的时间内得到了响应（？）。
如果 getPAGE() 函数无法返回 url，则它会将 None 返回给 isValid()，有时我会收到错误 -

解析此页面时出错，第三个异常块：“NoneType” 对象没有属性“字符串”id：....

这很奇怪，因为只要我从 getPAGE() 获得有效结果，我就会创建 soup 对象，而且 soup 函数似乎返回 None，这会引发一个每当我尝试运行时都会出现异常

companyID = soup.find('span',id='lblCompanyNumber').string

soup 对象永远不应该为 None，如果它到达代码的该部分，它应该从 getPAGE() 获取 HTML

我已经检查并发现该问题以某种方式与第一个问题相关（发送 GET 并且不等待回复，我看到（在 WireShark 上）每次我收到该异常时，都是针对 urllib2 发送 GET 的 url请求但没有等待响应并继续前进， getPAGE() 应该为该 url 返回 None ，但如果它返回 None isValid(ID) 不会通过“if page == None:”条件，我无法找出发生这种情况的原因，不可能复制该问题。

我读到 time.sleep() 可能会导致 urllib2 线程问题，所以也许我应该避免使用它？

为什么 urllib2 不总是等待响应（很少有不等待的情况

） “[errno 10054]现有连接被远程主机强制关闭”错误？顺便说一句 - 异常没有被 getPAGE() try: except 块捕获，它被第一个 isValid() try: except: 块捕获，这也很奇怪，因为 getPAGE() 假设捕获它抛出的所有异常。

try:
    address = urlPath+str(ID)
    page = getPAGE(address)
    if page == None:
        saveToCsv(ID, badRequest = True)
        return False
except Exception, e:
    print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address

谢谢！

原文

I've written a crawler that uses urllib2 to fetch URLs.

every few requests I get some weird behaviors, I've tried analyzing it with Wireshark and couldn't understand the problem.

getPAGE() is responsible for fetching the URL.
it returns the content of the URL (response.read()) if it successfully fetches the URL, else it returns None.

def getPAGE(FetchAddress):
    attempts = 0
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'}
    while attempts < 2:
        req = Request(FetchAddress, None ,headers)
        try:
            response = urlopen(req) #fetching the url
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            return response.read()
    return None

This is the function that calls getPAGE() and checks if the the page I've fetched is valid (checking with - companyID = soup.find('span',id='lblCompanyNumber').string If companyID is None the page is not valid), if the page is valid it saves the soup object to a global variable named 'curRes'.

def isValid(ID):
    global curRes
    try:
        address = urlPath+str(ID)
        page = getPAGE(address)
        if page == None:
            saveToCsv(ID, badRequest = True)
            return False
    except Exception, e:
        print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
    else:
        try:
            soup = BeautifulSoup(page)
        except TypeError, e:
            print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address
            return False
        try:
            companyID = soup.find('span',id='lblCompanyNumber').string
            if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file
                saveToCsv(ID, isEmpty = True)
                return False
            else:
                curRes = soup #we have the data we need, save the soup obj to a global variable
                return True
        except Exception, e:
            print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address
            return False

the strange behaviors are -

there are times that urllib2 executes a GET request and without
waiting for the reply it sends the next GET request (ignoring the last request)
sometimes I get "[errno 10054] An existing connection was forcibly closed by the remote host" after the code is simply stuck for about 20 minutes or so waiting for a response from the server, while it stucks I copy the URL and try to fetch it manually and I get a response in less than 1 sec (?).
getPAGE() function will return None to isValid() if it failed to return the url, sometimes I get the Error -

Error while parsing this page, third exception block: 'NoneType'
object has no attribute 'string' id:....

that's weird because I'm creating the soup object just if I got a valid result from getPAGE(), and it seems that the soup function is returning None, which is raising an exception whenever I try to run

companyID = soup.find('span',id='lblCompanyNumber').string

the soup object should never be None, it should get the HTML from getPAGE() if it reaches that part of the code

I've checked and saw that the problem is somehow connected to the first problem (sending GET and not waiting for the reply, I saw (on WireShark) that each time I got that exception it was for a url that urllib2 sent a GET request but didn't wait for the response and moved on, getPAGE() should have returned None for that url, but if it would return None isValid(ID) wouldn't pass the "if page == None:" condition, I can't find out why it is happening, it's impossible to replicate the issue.

I've read that time.sleep() can cause issues with urllib2 threading, so maybe I should avoid using it?

why doesn't urllib2 always wait for the response (it happens rarely that it doesn't wait)?

what can I do about the "[errno 10054] An existing connection was forcibly closed by the remote host" Error?
BTW - the exception isn't caught by getPAGE() try: except block, it is caught by the first isValid() try: except: block, which is also weird cause getPAGE() suppose to catch all the exceptions it throws.

try:
    address = urlPath+str(ID)
    page = getPAGE(address)
    if page == None:
        saveToCsv(ID, badRequest = True)
        return False
except Exception, e:
    print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address

Thanks!

分享到QQ

分享到微博