默认的内容类型/字符集是什么？

发布于 2024-08-12 20:57:43 字数 596 浏览 3 评论 0原文

我有获取内容类型以便更改为 Unicode。但是，有些网站没有“字符集”。

例如，此页面是“text/html”。我无法将其转换为 Unicode。

encoding=urlResponse.headers['content-type'].split('charset=')[-1]
htmlSource = unicode(htmlSource, encoding)
TypeError: 'int' object is not callable

是否有默认的“编码”（当然是英语）...这样，如果没有找到任何内容，我就可以使用它？

原文

According to this answer: urllib2 read to Unicode

I have to get the content-type in order to change to Unicode. However, some websites don't have a "charset".

For example, the ['content-type'] for this page is "text/html". I can't convert it to Unicode.

encoding=urlResponse.headers['content-type'].split('charset=')[-1]
htmlSource = unicode(htmlSource, encoding)
TypeError: 'int' object is not callable

Is there a default "encoding" (English, of course)...so that if nothing is found, I can just use that?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

飘过的浮云 2024-08-19 20:57:43

是否有默认的“编码”（当然是英语）...这样，如果没有找到任何内容，我就可以使用它？

不，没有。你一定猜到了。

简单的方法：尝试解码为 UTF-8。如果有效，那就太好了，它可能是 UTF-8。如果没有，请为您正在浏览的页面类型选择最可能的编码。对于英语页面，它是 cp1252，即 Windows 西欧编码。（就像 ISO-8859-1；事实上，即使您指定了该字符集，大多数浏览器也会使用 cp1252 而不是 iso-8859-1，因此值得复制行为。）

如果你需要猜测其他语言，那就会变得非常棘手。有现有的模块可以帮助您在这些情况下进行猜测。参见例如。 chardet。

回复收藏 0 原文

悟红尘 2024-08-19 20:57:43

好吧，我刚刚浏览了给定的 URL，该 URL 重定向到

http://www.engadget.com/2009/11/23/apple-hits-back-at-verizon-in-new-iphone-ads-video

然后在 Firefox 中点击 Ctrl + U （查看源代码），它显示

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

@Konrad：你的意思是“似乎虽然......使用ISO-8859-1”？

@alex：是什么让你认为它没有“字符集”？

查看您的代码（我们猜测是导致错误的行（请始终显示完整回溯和错误消息！））：

htmlSource = unicode(htmlSource, encoding)

错误消息：

TypeError: 'int' object is not callable

以及意味着 unicode 不是指内置函数，而是指 int。我记得在你的其他问题中，你有类似的建议，

if unicode == 1:

我建议你为该变量使用其他名称——比如 use_unicode。

更多建议：（1）始终显示足够的代码来重现错误（2）始终阅读错误消息。

Well, I just browsed the given URL, which redirects to

http://www.engadget.com/2009/11/23/apple-hits-back-at-verizon-in-new-iphone-ads-video

then hit Ctrl + U (view source) in Firefox and it shows

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

@Konrad: what do you mean "seems as though ... uses ISO-8859-1"??

@alex: what makes you think it doesn't have a "charset"??

Look at the code you have (which we guess is the line that cause the error (please always show full traceback and error message!)):

htmlSource = unicode(htmlSource, encoding)

and the error message:

TypeError: 'int' object is not callable

That means that unicode doesn't refer to the built-in function, it refers to an int. I recall that in your other question you had something like

if unicode == 1:

I suggest that you use some other name for that variable -- say use_unicode.

More suggestions: (1) always show enough code to reproduce the error (2) always read the error message.

回复收藏 0 原文

爱，才寂寞 2024-08-19 20:57:43

htmlSource=htmlSource.decode("utf8") 应该适用于大多数情况，除非您正在抓取非英语编码网站。

或者你可以像这样编写强制解码函数：

def forcedecode(text):
    for x in ["utf8","sjis","cp1252","utf16"]:
        try:return text.decode(x)
        except:pass
    return "Unknown Encoding"

htmlSource=htmlSource.decode("utf8") should work for most cases, except you are crawling non-English encoding sites.

Or you could write the force decode function like this:

def forcedecode(text):
    for x in ["utf8","sjis","cp1252","utf16"]:
        try:return text.decode(x)
        except:pass
    return "Unknown Encoding"

回复收藏 0 原文