默认的内容类型/字符集是什么?

发布于 2024-08-12 20:57:43 字数 596 浏览 3 评论 0原文

根据这个答案: urllib2 read to Unicode

我有获取内容类型以便更改为 Unicode。但是,有些网站没有“字符集”。

例如,页面是“text/html”。我无法将其转换为 Unicode。

encoding=urlResponse.headers['content-type'].split('charset=')[-1]
htmlSource = unicode(htmlSource, encoding)
TypeError: 'int' object is not callable

是否有默认的“编码”(当然是英语)...这样,如果没有找到任何内容,我就可以使用它?

According to this answer: urllib2 read to Unicode

I have to get the content-type in order to change to Unicode. However, some websites don't have a "charset".

For example, the ['content-type'] for this page is "text/html". I can't convert it to Unicode.

encoding=urlResponse.headers['content-type'].split('charset=')[-1]
htmlSource = unicode(htmlSource, encoding)
TypeError: 'int' object is not callable

Is there a default "encoding" (English, of course)...so that if nothing is found, I can just use that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

飘过的浮云 2024-08-19 20:57:43

是否有默认的“编码”(当然是英语)...这样,如果没有找到任何内容,我就可以使用它?

不,没有。你一定猜到了。

简单的方法:尝试解码为 UTF-8。如果有效,那就太好了,它可能是 UTF-8。如果没有,请为您正在浏览的页面类型选择最可能的编码。对于英语页面,它是 cp1252,即 Windows 西欧编码。 (就像 ISO-8859-1;事实上,即使您指定了该字符集,大多数浏览器也会使用 cp1252 而不是 iso-8859-1,因此值得复制行为。)

如果你需要猜测其他语言,那就会变得非常棘手。有现有的模块可以帮助您在这些情况下进行猜测。参见例如。 chardet

Is there a default "encoding" (English, of course)...so that if nothing is found, I can just use that?

No, there isn't. You must guess.

Trivial approach: try and decode as UTF-8. If it works, great, it's probably UTF-8. If it doesn't, choose a most-likely encoding for the kinds of pages you're browsing. For English pages that's cp1252, the Windows Western European encoding. (Which is like ISO-8859-1; in fact most browsers will use cp1252 instead of iso-8859-1 even if you specify that charset, so it's worth duplicating that behaviour.)

If you need to guess other languages, it gets very hairy. There are existing modules to help you guess in these situations. See eg. chardet.

悟红尘 2024-08-19 20:57:43

好吧,我刚刚浏览了给定的 URL,该 URL 重定向到

http://www.engadget.com/2009/11/23/apple-hits-back-at-verizon-in-new-iphone-ads-video

然后在 Firefox 中点击 Ctrl + U (查看源代码),它显示

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

@Konrad:你的意思是“似乎虽然......使用ISO-8859-1”?

@alex:是什么让你认为它没有“字符集”?

查看您的代码(我们猜测是导致错误的行(请始终显示完整回溯和错误消息!)):

htmlSource = unicode(htmlSource, encoding)

错误消息:

TypeError: 'int' object is not callable

以及 意味着 unicode 不是指内置函数,而是指 int。我记得在你的其他问题中,你有类似的建议,

if unicode == 1:

我建议你为该变量使用其他名称——比如 use_unicode。

更多建议:(1)始终显示足够的代码来重现错误(2)始终阅读错误消息。

Well, I just browsed the given URL, which redirects to

http://www.engadget.com/2009/11/23/apple-hits-back-at-verizon-in-new-iphone-ads-video

then hit Ctrl + U (view source) in Firefox and it shows

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

@Konrad: what do you mean "seems as though ... uses ISO-8859-1"??

@alex: what makes you think it doesn't have a "charset"??

Look at the code you have (which we guess is the line that cause the error (please always show full traceback and error message!)):

htmlSource = unicode(htmlSource, encoding)

and the error message:

TypeError: 'int' object is not callable

That means that unicode doesn't refer to the built-in function, it refers to an int. I recall that in your other question you had something like

if unicode == 1:

I suggest that you use some other name for that variable -- say use_unicode.

More suggestions: (1) always show enough code to reproduce the error (2) always read the error message.

爱,才寂寞 2024-08-19 20:57:43

htmlSource=htmlSource.decode("utf8") 应该适用于大多数情况,除非您正在抓取非英语编码网站。

或者你可以像这样编写强制解码函数:

def forcedecode(text):
    for x in ["utf8","sjis","cp1252","utf16"]:
        try:return text.decode(x)
        except:pass
    return "Unknown Encoding"

htmlSource=htmlSource.decode("utf8") should work for most cases, except you are crawling non-English encoding sites.

Or you could write the force decode function like this:

def forcedecode(text):
    for x in ["utf8","sjis","cp1252","utf16"]:
        try:return text.decode(x)
        except:pass
    return "Unknown Encoding"
吾家有女初长成 2024-08-19 20:57:43

如果没有明确的内容类型,则应为 ISO-8859-1,如答案中前面所述。不幸的是,情况并非总是如此,这就是为什么浏览器开发人员花了一些时间来让算法尝试根据页面内容猜测内容类型。

幸运的是,Mark Pilgrim 完成了将 Firefox 实现移植到 Python 的所有艰苦工作, chardet 模块 的形式。他的关于其工作原理的介绍 深入了解 Python 3 的其中一章是也非常值得一读。

If there's no explicit content type, it should be ISO-8859-1 as stated earlier in the answers. Unfortunately that's not always the case, which is why browser developers spent some time on getting algorithms going that try to guess the content type based on the content of your page.

Luckily for you, Mark Pilgrim did all the hard work on porting the Firefox implementation to Python, in the form of the chardet module. His introduction on how it works for one of the chapters of Dive Into Python 3 is also well worth reading.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文