在 python 中检测和更改网站编码

发布于 2024-10-28 15:22:21 字数 789 浏览 0 评论 0原文

我的网站编码有问题。我编写了一个程序来抓取网站,但我没有成功地更改读取内容的编码。我的代码是:

import sys,os,glob,re,datetime,optparse
import urllib2

from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup

#from utility import *

sTargetEncoding = "utf-8"

page_to_process = "http://www.xxxx.com" 
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding

ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content

document = BSXPathEvaluator(ucontent)

print "ORIGINAL ENCODING: " + document.originalEncoding

我使用外部库(BSXPath是BeautifulSoap的扩展),并且document.originalEncoding打印网站的编码,而不是我尝试更改的utf-8编码。 有人有什么建议吗?

谢谢

I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:

import sys,os,glob,re,datetime,optparse
import urllib2

from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup

#from utility import *

sTargetEncoding = "utf-8"

page_to_process = "http://www.xxxx.com" 
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding

ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content

document = BSXPathEvaluator(ucontent)

print "ORIGINAL ENCODING: " + document.originalEncoding

I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

破晓 2024-11-04 15:22:21

嗯,不能保证 HTTP 标头呈现的编码与 HTML 本身内部指定的编码相同。发生这种情况的原因可能是服务器端配置错误,也可能是 HTML 中的字符集定义错误。实际上没有自动方法来检测编码或检测正确编码。我建议手动检查 HTML 的编码是否正确(例如可以轻松检测到 iso-8859-1 与 utf-8),然后在应用程序中以某种方式手动对编码进行硬编码。

Well, there is no guarantee that the encoding presented by the HTTP headers is the same the some specified inside the HTML itself. This can happen either due to misconfiguration on the server side or the charset definition inside the HTML can be just wrong. There is really no automatic way to detect the encoding or to detect the right encoding. I suggest to check HTML manually for the right encoding (e.g. iso-8859-1 vs. utf-8 can be easily detected) and then hardcode the encoding somehow manually inside your app.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文