BeautifulSoupTag、字符串和 UnicodeEncodeError 就没那么漂亮了

发布于 2024-12-29 23:21:43 字数 647 浏览 2 评论 0原文

今天早上我花了几个令人沮丧的小时,试图处理抓取的网页中的字符串。我似乎无法找到一致的方法来小写提取的字符串,以便我可以检查关键字 - 这让我陷入困境。

下面是从 DOM 元素检索文本的代码片段:

temp = i.find('div', 'foobar').find('div')
if temp is not None and temp.contents is not None:
    temp2 = whitespace.sub(' ', temp.contents[0])
    content = str(temp2)

UnicodeEncodeError:“ascii”编解码器无法对字符 u'\xa0' 进行编码 位置 150:序数不在范围内(128)

我还尝试了以下语句 - 没有一个有效;即它们导致抛出相同的错误:

content = (str(temp2)).decode('utf-8').lower()
content = str(temp2.decode('utf-8')).lower()

有谁知道如何将 BeautifulSoupTag 中包含的文本转换为小写 ASCII,以便我可以对关键字进行不区分大小写的搜索?

I have spent several frustrating hours this morning, trying to handle strings from scraped web pages. I can't seem to get a consistent way of lowercasing the extracted string so I can check for keywords - and its driving me round the bend.

Here is a snippet of code that retrieves text from a DOM element:

temp = i.find('div', 'foobar').find('div')
if temp is not None and temp.contents is not None:
    temp2 = whitespace.sub(' ', temp.contents[0])
    content = str(temp2)

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 150: ordinal not in range(128)

I also tried the following statements - NONE of which worked; i.e. they resulted in the same error being thrown:

content = (str(temp2)).decode('utf-8').lower()
content = str(temp2.decode('utf-8')).lower()

Does anyone know how to convert teh text contained within a BeautifulSoupTag into lowercase ASCII, so I may conduct a case insensitive search for keyword(s)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

冷清清 2025-01-05 23:21:43

您可能需要 ASCII,但您需要 Unicode,而且很有可能您已经拥有它。 XML 解析器返回 unicode 对象。

首先执行 print type(temp2) ...它应该是 unicode 除非发生了不幸的事情,比如 whitespace.sub() 的事情;那是什么?

如果您想将多个空白字符规范化为单个空格,请执行

temp2 = u' '.join(temp.contents[0].split())

这将使 u'\xA0' 变得令人讨厌消失,因为它是一个空格(NO-BREAK SPACE)。

然后尝试 content = temp2.​​lower()

You may want ASCII, but you need Unicode, and it's a good chance that you've got it already. XML parsers return unicode objects.

Firstly do print type(temp2) ... It should be unicode unless something unfortunate has happened, like maybe that whitespace.sub() thingy; what is that?

If you want to normalise multiple whitespace characters into a single space, do

temp2 = u' '.join(temp.contents[0].split())

That will make that nasty u'\xA0' vanish, becase it's a whitespace (NO-BREAK SPACE).

Then try content = temp2.lower()

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文