当前位置：文江博客话题详情

如何在 python 中下载具有正确字符集的任何（！）网页？

发布于 2024-08-06 02:48:50 字数 1139 浏览 7 评论 0原文

问题

当使用 python 对网页进行屏幕抓取时，必须知道页面的字符编码。如果字符编码错误，那么输出将会混乱。

人们通常使用一些基本技术来检测编码。它们要么使用标头中的字符集，要么使用元标记中定义的字符集，或者使用编码检测器（不关心元标记或标题）。仅使用其中一种技术，有时您将无法获得与在浏览器中相同的结果。

浏览器这样做是这样的：

元标记总是优先（或 xml 定义）
当元标记中没有定义字符集时，使用标头中定义的编码
如果根本没有定义编码，则需要进行编码检测。

（嗯......至少我相信大多数浏览器都是这样做的。文档确实很少。）

我正在寻找的是一个可以像浏览器那样决定页面字符集的库.我确信我不是第一个需要适当解决此问题的人。

解决方案（我还没有尝试过...）

根据Beautiful Soup 的文档。

Beautiful Soup 按照优先级顺序尝试以下编码，将您的文档转换为 Unicode：

您传入的编码汤的 fromEncoding 参数构造函数。
在文档本身中发现的编码：例如，在 XML 声明中或（对于 HTML 文档）http-equiv META 标记中。如果 Beautiful Soup 在文档中发现这种编码，它会从头开始重新解析文档，并尝试新的编码。唯一的例外是，如果您显式指定了编码，并且该编码实际上有效：那么它将忽略在文档中找到的任何编码。
通过查看文件的前几个字节来嗅探编码。如果检测到编码在此阶段，它将成为其中之一 UTF-* 编码、EBCDIC 或 ASCII。
一个 chardet 嗅探到的编码库，如果你安装了它。
UTF-8
Windows-1252

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

是伱的 2024-08-13 02:48:50

当您使用 urllib 或 urllib2 下载文件时，您可以查明是否传输了字符集标头：

fp = urllib2.urlopen(request)
charset = fp.headers.getparam('charset')

您可以使用 BeautifulSoup 来定位 HTML 中的元元素：

soup = BeatifulSoup.BeautifulSoup(data)
meta = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})

如果两者都不可用，浏览器通常会回退到用户配置，并结合自动-检测。正如 rajax 所建议的，您可以使用 chardet 模块。如果您有可用的用户配置告诉您页面应该是中文（比如说），您也许可以做得更好。

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted:

fp = urllib2.urlopen(request)
charset = fp.headers.getparam('charset')

You can use BeautifulSoup to locate a meta element in the HTML:

soup = BeatifulSoup.BeautifulSoup(data)
meta = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})

If neither is available, browsers typically fall back to user configuration, combined with auto-detection. As rajax proposes, you could use the chardet module. If you have user configuration available telling you that the page should be Chinese (say), you may be able to do better.

回复收藏 0 原文

扛刀软妹 2024-08-13 02:48:50

使用通用编码检测器：

>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}

另一个选项是仅使用 wget：

  import os
  h = os.popen('wget -q -O foo1.txt http://foo.html')
  h.close()
  s = open('foo1.txt').read()

Use the Universal Encoding Detector:

>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}

The other option would be to just use wget:

  import os
  h = os.popen('wget -q -O foo1.txt http://foo.html')
  h.close()
  s = open('foo1.txt').read()

回复收藏 0 原文

溺ぐ爱和你が 2024-08-13 02:48:50

看来您需要混合提供的答案：

使用 urllib 获取页面
使用 beautiful soup 或其他方法查找标记
如果不存在元标记，请检查 urllib 返回的标头
如果是仍然没有给你答案，请使用通用编码检测器。

老实说，我不相信你会找到比这更好的东西。

事实上，如果您进一步阅读在其他答案的评论中链接到的常见问题解答，这就是检测器库的作者所提倡的。

如果您相信常见问题解答，这就是浏览器所做的（按照您原始问题中的要求），因为检测器是 Firefox 嗅探代码的端口。

回复收藏 0 原文

随风而去 2024-08-13 02:48:50

我会为此使用 html5lib 。

回复收藏 0 原文

疯了 2024-08-13 02:48:50

Scrapy 下载页面并检测其正确的编码，这与 requests.get(url).text 或 urlopen 不同。为此，它尝试遵循类似浏览器的规则 - 这是最好的规则，因为网站所有者有动力让他们的网站在浏览器中运行。 Scrapy 需要考虑 HTTP 标头、标签、BOM 标记以及编码名称的差异。

基于内容的猜测（chardet、UnicodeDammit）本身并不是正确的解决方案，因为它可能会失败；仅当标题或或 BOM 标记不可用或未提供任何信息时，才应将其用作最后的手段。

你不必使用Scrapy来获取其编码检测功能；它们（以及其他一些东西）在一个名为 w3lib 的单独库中发布： https://github.com/scrapy /w3lib。

要获取页面编码和 unicode 正文，请使用 w3lib.encoding.html_to_unicode< /a> 函数，具有基于内容的猜测后备：

import chardet
from w3lib.encoding import html_to_unicode

def _guess_encoding(data):
    return chardet.detect(data).get('encoding')

detected_encoding, html_content_unicode = html_to_unicode(
    content_type_header,
    html_content_bytes,
    default_encoding='utf8', 
    auto_detect_fun=_guess_encoding,
)

Scrapy downloads a page and detects a correct encoding for it, unlike requests.get(url).text or urlopen. To do so it tries to follow browser-like rules - this is the best one can do, because website owners have incentive to make their websites work in a browser. Scrapy needs to take HTTP headers, <meta> tags, BOM marks and differences in encoding names in account.

Content-based guessing (chardet, UnicodeDammit) on its own is not a correct solution, as it may fail; it should be only used as a last resort when headers or <meta> or BOM marks are not available or provide no information.

You don't have to use Scrapy to get its encoding detection functions; they are released (among with some other stuff) in a separate library called w3lib: https://github.com/scrapy/w3lib.

To get page encoding and unicode body use w3lib.encoding.html_to_unicode function, with a content-based guessing fallback:

import chardet
from w3lib.encoding import html_to_unicode

def _guess_encoding(data):
    return chardet.detect(data).get('encoding')

detected_encoding, html_content_unicode = html_to_unicode(
    content_type_header,
    html_content_bytes,
    default_encoding='utf8', 
    auto_detect_fun=_guess_encoding,
)

回复收藏 0 原文

桃扇骨 2024-08-13 02:48:50

与其尝试获取页面然后弄清楚浏览器将使用的字符集，为什么不直接使用浏览器来获取页面并检查它使用的字符集。

from win32com.client import DispatchWithEvents
import threading


stopEvent=threading.Event()

class EventHandler(object):
    def OnDownloadBegin(self):
        pass

def waitUntilReady(ie):
    """
    copypasted from
    http://mail.python.org/pipermail/python-win32/2004-June/002040.html
    """
    if ie.ReadyState!=4:
        while 1:
            print "waiting"
            pythoncom.PumpWaitingMessages()
            stopEvent.wait(.2)
            if stopEvent.isSet() or ie.ReadyState==4:
                stopEvent.clear()
                break;

ie = DispatchWithEvents("InternetExplorer.Application", EventHandler)
ie.Visible = 0
ie.Navigate('http://kskky.info')
waitUntilReady(ie)
d = ie.Document
print d.CharSet

instead of trying to get a page then figuring out the charset the browser would use, why not just use a browser to fetch the page and check what charset it uses..

from win32com.client import DispatchWithEvents
import threading


stopEvent=threading.Event()

class EventHandler(object):
    def OnDownloadBegin(self):
        pass

def waitUntilReady(ie):
    """
    copypasted from
    http://mail.python.org/pipermail/python-win32/2004-June/002040.html
    """
    if ie.ReadyState!=4:
        while 1:
            print "waiting"
            pythoncom.PumpWaitingMessages()
            stopEvent.wait(.2)
            if stopEvent.isSet() or ie.ReadyState==4:
                stopEvent.clear()
                break;

ie = DispatchWithEvents("InternetExplorer.Application", EventHandler)
ie.Visible = 0
ie.Navigate('http://kskky.info')
waitUntilReady(ie)
d = ie.Document
print d.CharSet

回复收藏 0 原文