使用 BeautifulSoup 从 HTML 页面获取内容类型

发布于 2024-10-26 16:29:15 字数 1116 浏览 1 评论 0原文

我试图获取我抓取的页面的字符编码，但在某些情况下它会失败。这就是我正在做的事情：

resp = urllib2.urlopen(request)
self.COOKIE_JAR.extract_cookies(resp, request)
content = resp.read()
encodeType= resp.headers.getparam('charset')
resp.close()

这是我的第一次尝试。但是，如果字符集以 None 类型返回，我会这样做：

soup = BeautifulSoup(html)
if encodeType == None:
    try:
        encodeType = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})
    except AttributeError, e:
        print e
        try:
            encodeType = soup.findAll('meta', {'charset':lambda v:v.lower() != None})
        except AttributeError, e:
            print e
            if encodeType == '':
                encodeType = 'iso-8859-1'

我正在测试的页面在标题中包含以下内容：

我希望第一个 try 语句返回一个空字符串，但是我在两个 try 语句上都收到此错误（这就是第二个语句的原因）现在是嵌套的）：

'NoneType' 对象没有属性 'lower'

第二个 try 语句有什么问题？我猜第一个也是不正确的，因为它抛出了一个错误，而不仅仅是返回空白。

或者更好是否有一种更优雅的方法来从页面中删除任何特殊字符编码？我想要实现的最终结果是我不关心任何特殊编码的字符。我想删除编码字符并保留原始文本。我可以跳过上述所有内容并告诉 BeautifulSoup 只删除任何编码的内容吗？

原文

I am trying to get the character encoding for pages that I scrape, but in some cases it is failing. Here is what I am doing:

resp = urllib2.urlopen(request)
self.COOKIE_JAR.extract_cookies(resp, request)
content = resp.read()
encodeType= resp.headers.getparam('charset')
resp.close()

That is my first attempt. But if charset comes back as type None, I do this:

soup = BeautifulSoup(html)
if encodeType == None:
    try:
        encodeType = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})
    except AttributeError, e:
        print e
        try:
            encodeType = soup.findAll('meta', {'charset':lambda v:v.lower() != None})
        except AttributeError, e:
            print e
            if encodeType == '':
                encodeType = 'iso-8859-1'

The page I am testing has this in the header:
<meta charset="ISO-8859-1">

I would expect the first try statement to return an empty string, but I get this error on both try statements (which is why the 2nd statement is nested for now):

'NoneType' object has no attribute 'lower'

What is wrong with the 2nd try statement? I am guessing the 1st one is incorrect as well, since it's throwing an error and not just coming back blank.

OR better yet is there a more elegant way to just remove any special character encoding from a page? My end result I am trying to accomplish is that I don't care about any of the specially encoded characters. I want to delete encoded characters and keep the raw text. Can I skip all of the above an tell BeautifulSoup to just strip anything that is encoded?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

樱花细雨 2024-11-02 16:29:15

我决定只接受 BeautifulSoup 吐出的任何东西。然后，当我解析文档中的每个单词时，如果无法将其转换为字符串，我就会忽略它。

for word in doc.lower().split(): 
        try:
            word = str(word)
            word = self.handlePunctuation(word)
            if word == False:
                continue
        except UnicodeEncodeError, e:
            #word couldn't be converted to string; most likely encoding garbage we can toss anyways
            continue

I decided to just go with whatever BeautifulSoup spits out. Then as I parse through each word in the document, if I can't convert it to a string, I just disregard it.

for word in doc.lower().split(): 
        try:
            word = str(word)
            word = self.handlePunctuation(word)
            if word == False:
                continue
        except UnicodeEncodeError, e:
            #word couldn't be converted to string; most likely encoding garbage we can toss anyways
            continue

回复收藏 0 原文

帅哥哥的热头脑 2024-11-02 16:29:15

当尝试确定页面的字符编码时，我认为应该尝试的顺序是：

通过元标记从 HTML 页面本身确定（例如 )
通过 HTTP 标头确定编码（例如 Content-Type: text/html; charset=ISO-8859-1
）如果以上没有产生任何事情，您都可以做一些事情，例如使用算法来使用页面中字节的分布来确定页面的字符编码（请注意，不能保证找到正确的编码）。查看 chardet 库以了解此选项。