通过 Python 抓取表单数据

发布于 2024-12-05 06:12:53 字数 2535 浏览 1 评论 0原文

我希望获取需要传递到特定网站的表单数据并提交。下面是我需要模拟的 html(仅表单)。我已经为此工作了几个小时,但似乎没有任何进展。我希望它可以在 Google App Engine 中运行。任何帮助都会很好。

<form method="post" action="/member/index.bv"> 
        <table cellspacing="0" cellpadding="0" border="0" width="100%"> 
            <tr> 
                <td align="left"> 
                    <h3>member login</h3><input type="hidden" name="submit" value="login" /><br /> 
                </td> 
            </tr> 
            <tr> 
                <td align="left" style="color: #8b6c46;"> 
                    email:<br /> 
                    <input type="text" name="email" style="width: 140px;" /> 
                </td> 
            </tr> 
            <tr> 
                <td align="left" style="color: #8b6c46;"> 
                    password:<br /> 
                    <input type="password" name="password" style="width: 140px;" /> 
                </td> 
            </t>
            <tr> 
                <td> 
                    <input type="image" class="formElementImageButton" src="/resources/default/images/btnLogin.gif" style="width: 46px; height: 17px;" /> 
                </td> 
            </tr> 
            <tr> 
                <td align="left"> 
                    <div style="line-height: 1.5em;"> 
                        <a href="/join/" style="color: #8b6c46; font-weight: bold; text-decoration: underline; ">join</a><br /> 
                        <a href="/member/forgot/" style="color: #8b6c46; font-weight: bold; text-decoration: underline;">forgot password?</a><input type="hidden" name="lastplace" value="%2F"><br /> 
                        having trouble logging on, <a href="/cookieProblems.bv">click here</a> for help
                    </div> 
                </td> 
            </tr> 
        </table> 
    </form>

目前我正在尝试使用此代码来访问它,但它不起作用。我对此很陌生,所以也许我只是想念它。

import urllib2, urllib

url = 'http://blah.com/member/index.bv'
values = {'email' : '[email protected]',
          'password' : 'somepassword'}

data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()

I'm looking to grab the form data that needs to be passed along to a specific website and submit it. Below is the html(form only) that I need to simulate. I've been working on this for a few hours, but can't seem to get anything to work. I want this to work in Google App Engine. Any help would be nice.

<form method="post" action="/member/index.bv"> 
        <table cellspacing="0" cellpadding="0" border="0" width="100%"> 
            <tr> 
                <td align="left"> 
                    <h3>member login</h3><input type="hidden" name="submit" value="login" /><br /> 
                </td> 
            </tr> 
            <tr> 
                <td align="left" style="color: #8b6c46;"> 
                    email:<br /> 
                    <input type="text" name="email" style="width: 140px;" /> 
                </td> 
            </tr> 
            <tr> 
                <td align="left" style="color: #8b6c46;"> 
                    password:<br /> 
                    <input type="password" name="password" style="width: 140px;" /> 
                </td> 
            </t>
            <tr> 
                <td> 
                    <input type="image" class="formElementImageButton" src="/resources/default/images/btnLogin.gif" style="width: 46px; height: 17px;" /> 
                </td> 
            </tr> 
            <tr> 
                <td align="left"> 
                    <div style="line-height: 1.5em;"> 
                        <a href="/join/" style="color: #8b6c46; font-weight: bold; text-decoration: underline; ">join</a><br /> 
                        <a href="/member/forgot/" style="color: #8b6c46; font-weight: bold; text-decoration: underline;">forgot password?</a><input type="hidden" name="lastplace" value="%2F"><br /> 
                        having trouble logging on, <a href="/cookieProblems.bv">click here</a> for help
                    </div> 
                </td> 
            </tr> 
        </table> 
    </form>

currently I'm trying to use this code to access it, but it's not working. I'm pretty new to this, so maybe I'm just missing it.

import urllib2, urllib

url = 'http://blah.com/member/index.bv'
values = {'email' : '[email protected]',
          'password' : 'somepassword'}

data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

旧伤还要旧人安 2024-12-12 06:12:53

这是第三方网站的登录页面吗?如果是这样,那么可能不仅仅是简单地发布表单输入。

例如,我刚刚在我自己的网站之一的登录页面上尝试了这一点。简单的发布请求在我的情况下不起作用,这可能与您正在访问的登录页面相同。

对于初学者来说,登录表单可能有一个隐藏的 csrf 令牌 值,您在发帖时必须发送该值您的登录请求。这意味着您必须首先获取登录页面并解析生成的html以获取csrf令牌值。服务器还可能在登录请求中需要其会话 cookie。

我正在使用 requests 模块来处理 get/post 和 beautifulsoup 来解析数据。

import requests                                                                                                                                                                                             
import zlib                                                                                                                                                                                                 
from BeautifulSoup import BeautifulSoup                                                                                                                                                                     

# first get the login page                                                                                                                                                                                                    
response = requests.get('https://www.site.com')                                                                                                                                                   
# if content is zipped, then you'll need to unzip it                                                                                                                                                                                 
html = zlib.decompress(response.read(), 16+zlib.MAX_WBITS)  
# parse the html for the csrf token                                                                                                                                                
soup = BeautifulSoup(html)                                                                                                                                                                                  
csrf_token = soup.find(name='input', id='csrf_token')['value']                                                                                                                                              

# now, submit the login data, including csrf token and the original cookie data                                                                                                                                          
response = requests.post('https://www.site.com/login',                                                                                                                                       
            {'csrf_token': csrf_token,                                                                                                                                                                  
             'username': 'username',                                                                                                                                                                            
             'password': 'ckrit'},                                                                                                                                                                           
             cookies=response.cookies)                                                                                                                                                                   

login_result = zlib.decompress(response.read(), 16+zlib.MAX_WBITS)                                                                                                                                                  
print login_result    

我不能说 GAE 是否允许这样做,但至少它可能有助于确定您在特定情况下可能需要什么。另外,正如卡尔指出的那样,如果使用提交输入来触发帖子,则必须将其包含在内。在我的特定示例中,这不是必需的。

Is this login page for a 3rd party site? If so, there may be more to it than simply posting the form inputs.

For example, I just tried this with the login page on one of my own sites. A simple post request won't work in my case, and this may be the same with the login page you are accessing as well.

For starters the login form may have a hidden csrf token value that you have to send when posting your login request. This means you'd have to first get the login page and parse the resulting html for the csrf token value. The server may also require its session cookie in the login request.

I'm using the requests module to handle the get/post and beautifulsoup to parse the data.

import requests                                                                                                                                                                                             
import zlib                                                                                                                                                                                                 
from BeautifulSoup import BeautifulSoup                                                                                                                                                                     

# first get the login page                                                                                                                                                                                                    
response = requests.get('https://www.site.com')                                                                                                                                                   
# if content is zipped, then you'll need to unzip it                                                                                                                                                                                 
html = zlib.decompress(response.read(), 16+zlib.MAX_WBITS)  
# parse the html for the csrf token                                                                                                                                                
soup = BeautifulSoup(html)                                                                                                                                                                                  
csrf_token = soup.find(name='input', id='csrf_token')['value']                                                                                                                                              

# now, submit the login data, including csrf token and the original cookie data                                                                                                                                          
response = requests.post('https://www.site.com/login',                                                                                                                                       
            {'csrf_token': csrf_token,                                                                                                                                                                  
             'username': 'username',                                                                                                                                                                            
             'password': 'ckrit'},                                                                                                                                                                           
             cookies=response.cookies)                                                                                                                                                                   

login_result = zlib.decompress(response.read(), 16+zlib.MAX_WBITS)                                                                                                                                                  
print login_result    

I cannot say if GAE will allow any of this or not, but at least it might be helpful in figuring out what you may require in your particular case. Also, as Carl points out, if a submit input is used to trigger the post you'd have to include it. In my particular example, this isn't required.

装迷糊 2024-12-12 06:12:53

您缺少隐藏的 Submit=login 参数。你有没有尝试过:

import urllib2, urllib

url = 'http://blah.com/member/index.bv'
values = {'submit':'login',
          'email' : '[email protected]',
          'password' : 'somepassword'}

data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()

You're missing the hidden submit=login argument. Have you tried:

import urllib2, urllib

url = 'http://blah.com/member/index.bv'
values = {'submit':'login',
          'email' : '[email protected]',
          'password' : 'somepassword'}

data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文