BeautifulSoupTag、字符串和 UnicodeEncodeError 就没那么漂亮了
今天早上我花了几个令人沮丧的小时,试图处理抓取的网页中的字符串。我似乎无法找到一致的方法来小写提取的字符串,以便我可以检查关键字 - 这让我陷入困境。
下面是从 DOM 元素检索文本的代码片段:
temp = i.find('div', 'foobar').find('div')
if temp is not None and temp.contents is not None:
temp2 = whitespace.sub(' ', temp.contents[0])
content = str(temp2)
UnicodeEncodeError:“ascii”编解码器无法对字符 u'\xa0' 进行编码 位置 150:序数不在范围内(128)
我还尝试了以下语句 - 没有一个有效;即它们导致抛出相同的错误:
content = (str(temp2)).decode('utf-8').lower()
content = str(temp2.decode('utf-8')).lower()
有谁知道如何将 BeautifulSoupTag 中包含的文本转换为小写 ASCII,以便我可以对关键字进行不区分大小写的搜索?
I have spent several frustrating hours this morning, trying to handle strings from scraped web pages. I can't seem to get a consistent way of lowercasing the extracted string so I can check for keywords - and its driving me round the bend.
Here is a snippet of code that retrieves text from a DOM element:
temp = i.find('div', 'foobar').find('div')
if temp is not None and temp.contents is not None:
temp2 = whitespace.sub(' ', temp.contents[0])
content = str(temp2)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in
position 150: ordinal not in range(128)
I also tried the following statements - NONE of which worked; i.e. they resulted in the same error being thrown:
content = (str(temp2)).decode('utf-8').lower()
content = str(temp2.decode('utf-8')).lower()
Does anyone know how to convert teh text contained within a BeautifulSoupTag into lowercase ASCII, so I may conduct a case insensitive search for keyword(s)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可能需要 ASCII,但您需要 Unicode,而且很有可能您已经拥有它。 XML 解析器返回
unicode
对象。首先执行
print type(temp2)
...它应该是unicode
除非发生了不幸的事情,比如whitespace.sub()
的事情;那是什么?如果您想将多个空白字符规范化为单个空格,请执行
temp2 = u' '.join(temp.contents[0].split())
这将使 u'\xA0' 变得令人讨厌消失,因为它是一个空格(NO-BREAK SPACE)。
然后尝试
content = temp2.lower()
You may want ASCII, but you need Unicode, and it's a good chance that you've got it already. XML parsers return
unicode
objects.Firstly do
print type(temp2)
... It should beunicode
unless something unfortunate has happened, like maybe thatwhitespace.sub()
thingy; what is that?If you want to normalise multiple whitespace characters into a single space, do
temp2 = u' '.join(temp.contents[0].split())
That will make that nasty u'\xA0' vanish, becase it's a whitespace (NO-BREAK SPACE).
Then try
content = temp2.lower()