在 python 中检测和更改网站编码

发布于 2024-10-28 15:22:21 字数 789 浏览 0 评论 0原文

我的网站编码有问题。我编写了一个程序来抓取网站，但我没有成功地更改读取内容的编码。我的代码是：

import sys,os,glob,re,datetime,optparse
import urllib2

from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup

#from utility import *

sTargetEncoding = "utf-8"

page_to_process = "http://www.xxxx.com" 
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding

ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content

document = BSXPathEvaluator(ucontent)

print "ORIGINAL ENCODING: " + document.originalEncoding

我使用外部库（BSXPath是BeautifulSoap的扩展），并且document.originalEncoding打印网站的编码，而不是我尝试更改的utf-8编码。有人有什么建议吗？

谢谢

原文

I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:

import sys,os,glob,re,datetime,optparse
import urllib2

from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup

#from utility import *

sTargetEncoding = "utf-8"

page_to_process = "http://www.xxxx.com" 
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding

ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content

document = BSXPathEvaluator(ucontent)

print "ORIGINAL ENCODING: " + document.originalEncoding

I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

破晓 2024-11-04 15:22:21

嗯，不能保证 HTTP 标头呈现的编码与 HTML 本身内部指定的编码相同。发生这种情况的原因可能是服务器端配置错误，也可能是 HTML 中的字符集定义错误。实际上没有自动方法来检测编码或检测正确编码。我建议手动检查 HTML 的编码是否正确（例如可以轻松检测到 iso-8859-1 与 utf-8），然后在应用程序中以某种方式手动对编码进行硬编码。

回复收藏 0 原文

~没有更多了~