pypsider 超时问题

发布于 2022-09-11 18:29:15 字数 3126 浏览 21 评论 0

问题描述

一个爬取上交所的项目，可以保证前两天执行时是完全没有问题的，调试完成后，转为running状态跑了两天也没有问题。后来为了修改数据库配置参数转为stop状态，在修改完成后，运行发现index_page这一步会卡住，直到最后报超时错误，错误如下：
[E 190218 15:56:31 base_handler:203] HTTP 599: Operation timed out after 121001 milliseconds with 0 bytes received
...
Exception: HTTP 599: Operation timed out after 121001 milliseconds with 0 bytes received

问题出现的环境背景及自己尝试过哪些方法

python 版本是 3.4，可以确定修改的数据库配置参数绝对没有问题（事实上回滚回去后依然卡超时）
我曾经将数据库相关的代码完全删除

!/usr/bin/env python

-- encoding: utf-8 --

Created on 2019-02-11 14:55:32

Project: ShangJiaoSuo_AnnualReport

import urllib.request
import urllib
import re
import os
import stat
import time
import shutil
import MySQLdb
from pyspider.libs.base_handler import *

class Handler(BaseHandler):


#@主站地址
global from_site
from_site = '上交所(http://www.sse.com.cn)'
#@当前页面地址
global curr_site
curr_site = 'http://www.sse.com.cn/disclosure/listedinfo/regular/'


'''
Function:数据库连接的初始化方法
''' 
def __init__(self):
    #数据库连接相关信息
    hosts =  
    username = 
    password = 
    database = 'scrmdata'
    charsets = 'utf8'

    self.db = MySQLdb.connect(hosts, username, password, database, charset='utf8')


         


'''
Function:初始的入口方法
'''
@every(minutes=24 * 60)
def on_start(self):
    #在这里设置要爬取的网址，设置fetch_type属性专门针对JS渲染出的内容
    self.crawl(curr_site, fetch_type="js", callback=self.index_page)  
    
'''    
Function:首页方法，从首页中找到需要爬取的关键标签
'''
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
    for each in response.doc('.list a').items():
        #print(each.attr.title)
        #这里使用save参数将title传入到详情页面的抓取方法中
        c_title = each.text().split(':')[1]
        self.crawl(each.attr.href, callback=self.detail_page, save={'title': c_title })

'''
Function:具体页面的爬取过程
'''
@config(priority=2)
def detail_page(self, response):
    
    '''
     这里首先是爬取了具体的网页内容，然后将网页内容解析成所需的数据
    '''
    #切割url路径，获取到文件名，文件发布时间，文件标签类型和公司股票代码
    file_name = response.url.split('/')[-1]
    
    msg_date = response.url.split('/')[-2]
    label = file_name.split('_')[-1].split('.')[-2]
    stock_code = file_name.split('_')[-3]
    
    #这里准备其他需要存储的数据 
    title = response.save['title']
    msg_url = response.url
    
          

    #先确认本地的对应路径是否存在，如果不存在要先创建
    isExists = os.path.exists(curr_path)
    if not isExists:
        os.makedirs(curr_path)
        print('create ' + curr_path + ' folder successfully')
        
    #保存pdf文件到本地
    pdf_url = response.url
    #urllib.request.urlretrieve(pdf_url, curr_path + file_name)
    print("successful to download to " + curr_path + file_name)
    
    #保存爬取并解析完成的数据到数据库中并且相应更新标签记录表
    #self.insert_sql()    
    print("The whole process for current data has been executed successfully")       
    
    '''这里是返回需要爬取的信息'''
    return {
     
      
        "title": title,
      
        "stock_code": stock_code,


        
    }