从 What.cd 检索页面

发布于 2024-12-09 12:00:35 字数 1035 浏览 1 评论 0 原文

我正在使用 BeautifulSoup for What.cd 使用 Python 开发屏幕抓取工具。我在工作时遇到了这个脚本并决定看看它,因为它看起来与我正在做的事情相似。但是,每次运行该脚本时,我都会收到一条消息,指出我的凭据错误,即使事实并非如此。

据我所知,我收到此消息是因为当脚本尝试登录到what.cd 时,what.cd 应该返回一个cookie,其中包含允许我稍后在脚本中请求页面的信息。所以脚本失败的地方是:

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username,
                               'password' : password})
check = opener.open('http://what.cd/login.php', login_data)
soup = BeautifulSoup(check.read())
warning = soup.find('span', 'warning')
if warning:
    exit(str(warning)+'\n\nprobably means username or pw is wrong')

我尝试了多种与网站进行身份验证的方法,包括使用 CookieFileJar,该脚本位于 此处,以及请求模块。我收到的每条消息都是相同的 HTML 消息。简而言之,它表示“Javascript 已禁用”和“Cookie 已禁用”,并且还提供了一个 HTML 登录框。

我真的不想搞乱机械化,但目前我看不到任何其他方法可以做到这一点。如果有人可以提供任何帮助,我们将不胜感激。

I'm working on a screen scraper using BeautifulSoup for what.cd using Python. I came across this script while working and decided to look at it, since it seems to be similar to what I'm working on. However, every time I run the script I get a message that my credentials are wrong, even though they are not.

As far as I can tell, I'm getting this message because when the script tries to log into what.cd, what.cd is supposed to return a cookie containing the information that lets me request pages later in the script. So where the script is failing is:

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username,
                               'password' : password})
check = opener.open('http://what.cd/login.php', login_data)
soup = BeautifulSoup(check.read())
warning = soup.find('span', 'warning')
if warning:
    exit(str(warning)+'\n\nprobably means username or pw is wrong')

I've tried multiple methods of authenticating with the site including using CookieFileJar, the script located here, and the Requests module. I've gotten the same HTML message with each one. It says, in short, that "Javascript is disabled", and "Cookies are disabled", and also provides a login box in HTML.

I don't really want to mess around with Mechanize, but I don't see any other way to do it at the moment. If anyone can provide any help, it would be greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

带刺的爱情 2024-12-16 12:00:35

经过几个小时的搜索,我找到了解决问题的方法。我仍然不确定为什么这段代码的工作方式与上面的版本相反,但它确实如此。这是我现在使用的代码:

import urllib
import urllib2
import cookielib

cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

request = urllib2.Request("http://what.cd/index.php", None)
f = urllib2.urlopen(request)
f.close()

data = urllib.urlencode({"username": "your-login", "password" : "your-password"})
request = urllib2.Request("http://what.cd/login.php", data)
f = urllib2.urlopen(request)

html = f.read()
f.close()

信用来自 linuxquestions.org。感谢所有提供意见的人。

After a few more hours of searching, I found the solution to my problem. I'm still not sure why this code works as apposed to the version above, but it does. Here is the code I'm using now:

import urllib
import urllib2
import cookielib

cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

request = urllib2.Request("http://what.cd/index.php", None)
f = urllib2.urlopen(request)
f.close()

data = urllib.urlencode({"username": "your-login", "password" : "your-password"})
request = urllib2.Request("http://what.cd/login.php", data)
f = urllib2.urlopen(request)

html = f.read()
f.close()

Credit goes to carl.waldbieser from linuxquestions.org. Thanks for everyone who gave input.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文