通过 Python 抓取表单数据
我希望获取需要传递到特定网站的表单数据并提交。下面是我需要模拟的 html(仅表单)。我已经为此工作了几个小时,但似乎没有任何进展。我希望它可以在 Google App Engine 中运行。任何帮助都会很好。
<form method="post" action="/member/index.bv">
<table cellspacing="0" cellpadding="0" border="0" width="100%">
<tr>
<td align="left">
<h3>member login</h3><input type="hidden" name="submit" value="login" /><br />
</td>
</tr>
<tr>
<td align="left" style="color: #8b6c46;">
email:<br />
<input type="text" name="email" style="width: 140px;" />
</td>
</tr>
<tr>
<td align="left" style="color: #8b6c46;">
password:<br />
<input type="password" name="password" style="width: 140px;" />
</td>
</t>
<tr>
<td>
<input type="image" class="formElementImageButton" src="/resources/default/images/btnLogin.gif" style="width: 46px; height: 17px;" />
</td>
</tr>
<tr>
<td align="left">
<div style="line-height: 1.5em;">
<a href="/join/" style="color: #8b6c46; font-weight: bold; text-decoration: underline; ">join</a><br />
<a href="/member/forgot/" style="color: #8b6c46; font-weight: bold; text-decoration: underline;">forgot password?</a><input type="hidden" name="lastplace" value="%2F"><br />
having trouble logging on, <a href="/cookieProblems.bv">click here</a> for help
</div>
</td>
</tr>
</table>
</form>
目前我正在尝试使用此代码来访问它,但它不起作用。我对此很陌生,所以也许我只是想念它。
import urllib2, urllib
url = 'http://blah.com/member/index.bv'
values = {'email' : '[email protected]',
'password' : 'somepassword'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
I'm looking to grab the form data that needs to be passed along to a specific website and submit it. Below is the html(form only) that I need to simulate. I've been working on this for a few hours, but can't seem to get anything to work. I want this to work in Google App Engine. Any help would be nice.
<form method="post" action="/member/index.bv">
<table cellspacing="0" cellpadding="0" border="0" width="100%">
<tr>
<td align="left">
<h3>member login</h3><input type="hidden" name="submit" value="login" /><br />
</td>
</tr>
<tr>
<td align="left" style="color: #8b6c46;">
email:<br />
<input type="text" name="email" style="width: 140px;" />
</td>
</tr>
<tr>
<td align="left" style="color: #8b6c46;">
password:<br />
<input type="password" name="password" style="width: 140px;" />
</td>
</t>
<tr>
<td>
<input type="image" class="formElementImageButton" src="/resources/default/images/btnLogin.gif" style="width: 46px; height: 17px;" />
</td>
</tr>
<tr>
<td align="left">
<div style="line-height: 1.5em;">
<a href="/join/" style="color: #8b6c46; font-weight: bold; text-decoration: underline; ">join</a><br />
<a href="/member/forgot/" style="color: #8b6c46; font-weight: bold; text-decoration: underline;">forgot password?</a><input type="hidden" name="lastplace" value="%2F"><br />
having trouble logging on, <a href="/cookieProblems.bv">click here</a> for help
</div>
</td>
</tr>
</table>
</form>
currently I'm trying to use this code to access it, but it's not working. I'm pretty new to this, so maybe I'm just missing it.
import urllib2, urllib
url = 'http://blah.com/member/index.bv'
values = {'email' : '[email protected]',
'password' : 'somepassword'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是第三方网站的登录页面吗?如果是这样,那么可能不仅仅是简单地发布表单输入。
例如,我刚刚在我自己的网站之一的登录页面上尝试了这一点。简单的发布请求在我的情况下不起作用,这可能与您正在访问的登录页面相同。
对于初学者来说,登录表单可能有一个隐藏的 csrf 令牌 值,您在发帖时必须发送该值您的登录请求。这意味着您必须首先
获取
登录页面并解析生成的html以获取csrf令牌
值。服务器还可能在登录请求中需要其会话 cookie。我正在使用 requests 模块来处理 get/post 和 beautifulsoup 来解析数据。
我不能说 GAE 是否允许这样做,但至少它可能有助于确定您在特定情况下可能需要什么。另外,正如卡尔指出的那样,如果使用提交输入来触发帖子,则必须将其包含在内。在我的特定示例中,这不是必需的。
Is this login page for a 3rd party site? If so, there may be more to it than simply posting the form inputs.
For example, I just tried this with the login page on one of my own sites. A simple post request won't work in my case, and this may be the same with the login page you are accessing as well.
For starters the login form may have a hidden csrf token value that you have to send when posting your login request. This means you'd have to first
get
the login page and parse the resulting html for thecsrf token
value. The server may also require its session cookie in the login request.I'm using the requests module to handle the get/post and beautifulsoup to parse the data.
I cannot say if GAE will allow any of this or not, but at least it might be helpful in figuring out what you may require in your particular case. Also, as Carl points out, if a submit input is used to trigger the post you'd have to include it. In my particular example, this isn't required.
您缺少隐藏的 Submit=login 参数。你有没有尝试过:
You're missing the hidden submit=login argument. Have you tried: