如何翻译/转换 unicode escaped<和>在阅读 HTML 文档中？

发布于 2024-11-17 01:36:51 字数 2169 浏览 0 评论 0原文

当我使用 urllib2 opener 在 python 中读取一些（但不是全部）HTML 文件时，在某些文件上，我得到的文本充满了大量反斜杠和 unicode 003c 字符串。我正在将此文本发送到 BeautifulSoup 中，但无法使用 findAll() 找到我要查找的内容，现在我认为这是由于所有这些 unicode 字符串造成的。

这是怎么回事？我该如何摆脱它？

像 soup.prettify() 这样的方法没有效果。

下面是一些示例代码（来自 Facebook 个人资料）。

\\u003cdiv class=\\"pas status fcg\\">Loading...\\u003c\\/div>
\\u003c\\/div>\\u003cdiv class=\\"uiTypeaheadView fbChatBuddyListTypeaheadView dark hidden_elem\\" id=\\"u971289_14\\">\\u003c\\/div>
\\u003c\\/div>\\u003c\\/div>\\u003cdiv class=\\"fbNubFlyoutFooter\\">
\\u003cdiv class=\\"uiTypeahead uiClearableTypeahead fbChatTypeahead\\" id=\\"u971289_15\\">
\\u003cdiv class=\\"wrap\\">\\u003clabel class=\\"clear uiCloseButton\\" for=\\"u971291_21\\">

这个相同的 HTML 页面在“查看源代码”窗口中看起来很好很正常。

编辑：这是生成该文本的代码。奇怪的是我没有从其他 HTML 页面得到这种输出。请注意，我已将此处的用户名和密码替换为 USERNAME 和 PASSWORD。如果你替换这两个，你可以在你自己的 FB 配置文件上尝试这个。

fbusername = "[email protected]"
fbpassword = "PASSWORD"
cookiefile = "facebook.cookies"

cj = cookielib.MozillaCookieJar(cookiefile)
if os.access(cookiefile, os.F_OK):
    cf.load()

opener = urllib2.build_opener(
    urllib2.HTTPRedirectHandler(),
    urllib2.HTTPHandler(debuglevel=0),
    urllib2.HTTPSHandler(debuglevel=0),
    urllib2.HTTPCookieProcessor(cj)
)

opener.addheaders = [('User-agent','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; en-us) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1'),('Referer','http://www.facebook.com/')]

def facebooklogin():
    logindata = urllib.urlencode({
        'email' : fbusername,
        'pass' : fbpassword,
    })

    response = opener.open("https://login.facebook.com/login.php",logindata)
    return ''.join(response.readlines())


print "Logging in to Facebook...\n"
facebooklogin()
facebooklogin()
print "Successful.\n"

fetchURL = 'http://www.facebook.com/USERNAME?ref=profile&v=info'

f = opener.open(fetchURL)
fba = f.read()
f.close()
soup = BeautifulSoup(fba)
print soup

原文

When I read some (but not all) HTML files in python using a urllib2 opener, on some files I'm getting text filled with lots of backslashes and the unicode 003c strings. I'm sending this text into BeautifulSoup and am having trouble finding what I'm looking for with findAll(), and I'm now thinking it's due to all these unicode strings.

What's going on with this, and how do I get rid of it?

Approaches like soup.prettify() have no effect.

Here's some example code (this comes from a Facebook profile)

\\u003cdiv class=\\"pas status fcg\\">Loading...\\u003c\\/div>
\\u003c\\/div>\\u003cdiv class=\\"uiTypeaheadView fbChatBuddyListTypeaheadView dark hidden_elem\\" id=\\"u971289_14\\">\\u003c\\/div>
\\u003c\\/div>\\u003c\\/div>\\u003cdiv class=\\"fbNubFlyoutFooter\\">
\\u003cdiv class=\\"uiTypeahead uiClearableTypeahead fbChatTypeahead\\" id=\\"u971289_15\\">
\\u003cdiv class=\\"wrap\\">\\u003clabel class=\\"clear uiCloseButton\\" for=\\"u971291_21\\">

This same HTML page looks fine and normal in a 'view source' window.

EDIT: Here's the code that's producing that text. What's strange is that I don't get this kind of output from other HTML pages. Note that I've replaced the username and password with USERNAME and PASSWORD for here. You could try this on your own FB profile if you replace those two.

fbusername = "[email protected]"
fbpassword = "PASSWORD"
cookiefile = "facebook.cookies"

cj = cookielib.MozillaCookieJar(cookiefile)
if os.access(cookiefile, os.F_OK):
    cf.load()

opener = urllib2.build_opener(
    urllib2.HTTPRedirectHandler(),
    urllib2.HTTPHandler(debuglevel=0),
    urllib2.HTTPSHandler(debuglevel=0),
    urllib2.HTTPCookieProcessor(cj)
)

opener.addheaders = [('User-agent','Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; en-us) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1'),('Referer','http://www.facebook.com/')]

def facebooklogin():
    logindata = urllib.urlencode({
        'email' : fbusername,
        'pass' : fbpassword,
    })

    response = opener.open("https://login.facebook.com/login.php",logindata)
    return ''.join(response.readlines())


print "Logging in to Facebook...\n"
facebooklogin()
facebooklogin()
print "Successful.\n"

fetchURL = 'http://www.facebook.com/USERNAME?ref=profile&v=info'

f = opener.open(fetchURL)
fba = f.read()
f.close()
soup = BeautifulSoup(fba)
print soup

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猥︴琐丶欲为 2024-11-24 01:36:51

u""" 构造适用于 Python 2。您省略了 Python 3 的 u 。

>>> a=u"""\\u003cdiv class=\\"pas status fcg\\">Loading...\\u003c\\/div>
... \\u003c\\/div>\\u003cdiv class=\\"uiTypeaheadView fbChatBuddyListTypeaheadView dark hidden_elem\\" id=\\"u971289_14\\">\\u003c\\/div>
... \\u003c\\/div>\\u003c\\/div>\\u003cdiv class=\\"fbNubFlyoutFooter\\">
... \\u003cdiv class=\\"uiTypeahead uiClearableTypeahead fbChatTypeahead\\" id=\\"u971289_15\\">
... \\u003cdiv class=\\"wrap\\">\\u003clabel class=\\"clear uiCloseButton\\" for=\\"u971291_21\\">
... """
>>> print(a.decode('unicode_escape')).replace('\\/', '/')
<div class="pas status fcg">Loading...<\/div>
<\/div><div class="uiTypeaheadView fbChatBuddyListTypeaheadView dark hidden_elem" id="u971289_14"><\/div>
<\/div><\/div><div class="fbNubFlyoutFooter">
<div class="uiTypeahead uiClearableTypeahead fbChatTypeahead" id="u971289_15">
<div class="wrap"><label class="clear uiCloseButton" for="u971291_21">

我希望这有帮助。如果没有，请改进您在问题中提供的信息编辑：

建议的答案现在也将 \/ 更改为 / 。

The u""" construct is for Python 2. You omit the u for Python 3.

>>> a=u"""\\u003cdiv class=\\"pas status fcg\\">Loading...\\u003c\\/div>
... \\u003c\\/div>\\u003cdiv class=\\"uiTypeaheadView fbChatBuddyListTypeaheadView dark hidden_elem\\" id=\\"u971289_14\\">\\u003c\\/div>
... \\u003c\\/div>\\u003c\\/div>\\u003cdiv class=\\"fbNubFlyoutFooter\\">
... \\u003cdiv class=\\"uiTypeahead uiClearableTypeahead fbChatTypeahead\\" id=\\"u971289_15\\">
... \\u003cdiv class=\\"wrap\\">\\u003clabel class=\\"clear uiCloseButton\\" for=\\"u971291_21\\">
... """
>>> print(a.decode('unicode_escape')).replace('\\/', '/')
<div class="pas status fcg">Loading...<\/div>
<\/div><div class="uiTypeaheadView fbChatBuddyListTypeaheadView dark hidden_elem" id="u971289_14"><\/div>
<\/div><\/div><div class="fbNubFlyoutFooter">
<div class="uiTypeahead uiClearableTypeahead fbChatTypeahead" id="u971289_15">
<div class="wrap"><label class="clear uiCloseButton" for="u971291_21">

I hope this helps. If not, please improve the information you give in your question.

EDIT: suggested answer now changes \/ to / too.

回复收藏 0 原文

~没有更多了~