BeautifulSoupTag、字符串和 UnicodeEncodeError 就没那么漂亮了

发布于 2024-12-29 23:21:43 字数 647 浏览 2 评论 0原文

今天早上我花了几个令人沮丧的小时，试图处理抓取的网页中的字符串。我似乎无法找到一致的方法来小写提取的字符串，以便我可以检查关键字 - 这让我陷入困境。

下面是从 DOM 元素检索文本的代码片段：

temp = i.find('div', 'foobar').find('div')
if temp is not None and temp.contents is not None:
    temp2 = whitespace.sub(' ', temp.contents[0])
    content = str(temp2)

UnicodeEncodeError：“ascii”编解码器无法对字符 u'\xa0' 进行编码位置 150：序数不在范围内(128)

我还尝试了以下语句 - 没有一个有效；即它们导致抛出相同的错误：

content = (str(temp2)).decode('utf-8').lower()
content = str(temp2.decode('utf-8')).lower()

有谁知道如何将 BeautifulSoupTag 中包含的文本转换为小写 ASCII，以便我可以对关键字进行不区分大小写的搜索？

原文

I have spent several frustrating hours this morning, trying to handle strings from scraped web pages. I can't seem to get a consistent way of lowercasing the extracted string so I can check for keywords - and its driving me round the bend.

Here is a snippet of code that retrieves text from a DOM element:

temp = i.find('div', 'foobar').find('div')
if temp is not None and temp.contents is not None:
    temp2 = whitespace.sub(' ', temp.contents[0])
    content = str(temp2)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 150: ordinal not in range(128)

I also tried the following statements - NONE of which worked; i.e. they resulted in the same error being thrown:

content = (str(temp2)).decode('utf-8').lower()
content = str(temp2.decode('utf-8')).lower()

Does anyone know how to convert teh text contained within a BeautifulSoupTag into lowercase ASCII, so I may conduct a case insensitive search for keyword(s)?

分享到QQ

分享到微博