抓取小说网页，不知如何处理编码问题

发布于 2022-09-03 07:59:32 字数 1548 浏览 11 评论 0

抓取小说网页，链接在代码中。数据处理为字典，写入txt文件中。不知道怎么处理编码问题，中文不能正确显示，折腾了整个晚上，还是不行，要死。
还有一个问题就是网页源代码中在<p>标签中经常插入<u>----</u>标签，搞到有些字抓取之后看不见，也求一并解答。
刚学python,望大神指点。首先感谢你宝贵的时间，谢谢。

# coding:utf-8
from lxml import etree
from multiprocessing.dummy import Pool as ThreadPool
import requests
import sys

reload(sys)

sys.setdefaultencoding('utf-8')

def towrite(contentdict):
    f.writelines(u'章：' + unicode(contentdict['title']) + '\n')
    f.writelines(unicode(contentdict['content']) + '\n\n')

def spider(url):
    head = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
    html = requests.get(url,headers = head)
    html.encoding = 'utf-8'
    print html.text
    seletor = etree.HTML(html.text)
    title = seletor.xpath('//div[@id="f_title1"]/h1/text()')
    content = seletor.xpath('//div[@id="f_content1"]/div/p/text()')
    fullcontent = ''
    for each in content:
        fullcontent += each
    # title.decode('gb2312').encode('utf-8')
    # fullcontent.encode('utf-8')
    item = {}
    item['title'] = title
    item['content'] = fullcontent
    towrite(item)

if __name__ == '__main__':
    pool = ThreadPool(4)
    f = open('guduliujiang.txt','a')
    page = []
    for i in range(1,29):
        newpage = 'http://www.sbkk8.cn/mingzhu/zhongguoxiandaiwenxuemingzhu/guduliujiang/' + str(145232-i) + '.html'
        page.append(newpage)

    results = pool.map(spider,page)
    pool.close()
    pool.join()
    f.close()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

终难遇 2022-09-10 07:59:32

网页源码的编码是gb2312，不是utf-8

回复收藏 0

二智少女猫性小仙女 2022-09-10 07:59:32

requests搭配BeautifulSoup使用

回复收藏 0

倾其所爱 2022-09-10 07:59:32

问题一：
使用 print html.encoding 显示的是编码格式为ISO-8859-1
print html.text.decode('ISO-8859-1') 没有效果
参照下列代码可在终端中打印出汉字

html = requests.get(url,headers = head)
print html.encoding
#print html.headers['content-type']
print html.text.encode('latin-1').decode('gbk')

问题二：
以你的小说网页（情慾孤独(2)）为例

content = seletor.xpath('//div[@id="f_content1"]')[0]
real_content=content.xpath('string(.)')
print real_content

试试是不是你想要的结果

下面的是我使用你的代码做的测试代码：

#-*-coding:utf-8-*-
from lxml import etree
from multiprocessing.dummy import Pool as ThreadPool
import requests
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def towrite(contentdict):
    f.writelines(u'章：' + unicode(contentdict['title']) + '\n')
    f.writelines(unicode(contentdict['content']) + '\n\n')

def spider(url):
    head = {'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'}
    html = requests.get(url,headers = head)
    print html.encoding
    print html.headers['content-type']
    print html.text.encode('latin-1').decode('gbk')
    context=html.text.encode('latin-1').decode('gbk')
    seletor = etree.HTML(context)
#   title = seletor.xpath('//div[@id="f_title1"]/h1/text()')
    content = seletor.xpath('//div[@id="f_content1"]')[0]
    real_content=content.xpath('string(.)')
    print real_content
#    fullcontent = ''
#    for each in content:
#        fullcontent += each
    # title.decode('gb2312').encode('utf-8')
    # fullcontent.encode('utf-8')
#    item = {}
#    item['title'] = title
#    item['content'] = fullcontent
#    towrite(item)

if __name__ == '__main__':
#    pool = ThreadPool(4)
#    f = open('guduliujiang.txt','a')
#    page = []
#    for i in range(1,29):
    newpage = 'http://www.sbkk8.cn/mingzhu/zhongguoxiandaiwenxuemingzhu/guduliujiang/145229.html'
#        page.append(newpage)

#    results = pool.map(spider,page)
#    pool.close()
#    pool.join()
#    f.close()
    spider(newpage)

回复收藏 0

~没有更多了~