我正在使用 BeautifulSoup for What.cd 使用 Python 开发屏幕抓取工具。我在工作时遇到了这个脚本并决定看看它,因为它看起来与我正在做的事情相似。但是,每次运行该脚本时,我都会收到一条消息,指出我的凭据错误,即使事实并非如此。
据我所知,我收到此消息是因为当脚本尝试登录到what.cd 时,what.cd 应该返回一个cookie,其中包含允许我稍后在脚本中请求页面的信息。所以脚本失败的地方是:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username,
'password' : password})
check = opener.open('http://what.cd/login.php', login_data)
soup = BeautifulSoup(check.read())
warning = soup.find('span', 'warning')
if warning:
exit(str(warning)+'\n\nprobably means username or pw is wrong')
我尝试了多种与网站进行身份验证的方法,包括使用 CookieFileJar,该脚本位于 此处,以及请求模块。我收到的每条消息都是相同的 HTML 消息。简而言之,它表示“Javascript 已禁用”和“Cookie 已禁用”,并且还提供了一个 HTML 登录框。
我真的不想搞乱机械化,但目前我看不到任何其他方法可以做到这一点。如果有人可以提供任何帮助,我们将不胜感激。
I'm working on a screen scraper using BeautifulSoup for what.cd using Python. I came across this script while working and decided to look at it, since it seems to be similar to what I'm working on. However, every time I run the script I get a message that my credentials are wrong, even though they are not.
As far as I can tell, I'm getting this message because when the script tries to log into what.cd, what.cd is supposed to return a cookie containing the information that lets me request pages later in the script. So where the script is failing is:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username,
'password' : password})
check = opener.open('http://what.cd/login.php', login_data)
soup = BeautifulSoup(check.read())
warning = soup.find('span', 'warning')
if warning:
exit(str(warning)+'\n\nprobably means username or pw is wrong')
I've tried multiple methods of authenticating with the site including using CookieFileJar, the script located here, and the Requests module. I've gotten the same HTML message with each one. It says, in short, that "Javascript is disabled", and "Cookies are disabled", and also provides a login box in HTML.
I don't really want to mess around with Mechanize, but I don't see any other way to do it at the moment. If anyone can provide any help, it would be greatly appreciated.
发布评论
评论(1)
经过几个小时的搜索,我找到了解决问题的方法。我仍然不确定为什么这段代码的工作方式与上面的版本相反,但它确实如此。这是我现在使用的代码:
信用来自 linuxquestions.org。感谢所有提供意见的人。
After a few more hours of searching, I found the solution to my problem. I'm still not sure why this code works as apposed to the version above, but it does. Here is the code I'm using now:
Credit goes to carl.waldbieser from linuxquestions.org. Thanks for everyone who gave input.