在 python 中检测和更改网站编码
我的网站编码有问题。我编写了一个程序来抓取网站,但我没有成功地更改读取内容的编码。我的代码是:
import sys,os,glob,re,datetime,optparse
import urllib2
from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup
#from utility import *
sTargetEncoding = "utf-8"
page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding
ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content
document = BSXPathEvaluator(ucontent)
print "ORIGINAL ENCODING: " + document.originalEncoding
我使用外部库(BSXPath是BeautifulSoap的扩展),并且document.originalEncoding打印网站的编码,而不是我尝试更改的utf-8编码。 有人有什么建议吗?
谢谢
I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:
import sys,os,glob,re,datetime,optparse
import urllib2
from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup
#from utility import *
sTargetEncoding = "utf-8"
page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding
ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content
document = BSXPathEvaluator(ucontent)
print "ORIGINAL ENCODING: " + document.originalEncoding
I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
嗯,不能保证 HTTP 标头呈现的编码与 HTML 本身内部指定的编码相同。发生这种情况的原因可能是服务器端配置错误,也可能是 HTML 中的字符集定义错误。实际上没有自动方法来检测编码或检测正确编码。我建议手动检查 HTML 的编码是否正确(例如可以轻松检测到 iso-8859-1 与 utf-8),然后在应用程序中以某种方式手动对编码进行硬编码。
Well, there is no guarantee that the encoding presented by the HTTP headers is the same the some specified inside the HTML itself. This can happen either due to misconfiguration on the server side or the charset definition inside the HTML can be just wrong. There is really no automatic way to detect the encoding or to detect the right encoding. I suggest to check HTML manually for the right encoding (e.g. iso-8859-1 vs. utf-8 can be easily detected) and then hardcode the encoding somehow manually inside your app.