如何在不使用 API 的情况下以编程方式执行搜索?

发布于 2024-07-27 22:27:42 字数 183 浏览 10 评论 0原文

我想创建一个程序,将字符串输入到 Google 等网站的文本框中(不使用其公共 API),然后提交表单并获取结果。 这可能吗? 我认为抓取结果需要使用 HTML 抓取,但是如何在文本字段中输入数据并提交表单呢? 我会被迫使用公共 API 吗? 难道这样的事情根本不可行吗? 我必须弄清楚查询字符串/参数吗?

谢谢

I would like to create a program that will enter a string into the text box on a site like Google (without using their public API) and then submit the form and grab the results. Is this possible? Grabbing the results will require the use of HTML scraping I would assume, but how would I enter data into the text field and submit the form? Would I be forced to use a public API? Is something like this just not feasible? Would I have to figure out query strings/parameters?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

杯别 2024-08-03 22:27:43

理论

我要做的是创建一个小程序,可以自动将任何表单数据提交到任何地方并返回结果。 在 Java 中使用 HTTPUnit 可以很容易地做到这一点。 任务如下:

  • 连接到 Web 服务器。
  • 解析页面。
  • 获取页面上的第一个表单。
  • 填写表格数据。
  • 提交表格。
  • 读取(并解析)结果。

您选择的解决方案将取决于多种因素,包括:

  • 您是否需要模拟 JavaScript
  • 之后您需要对数据执行什么操作
  • 您精通哪些语言
  • 应用程序速度(这是一次查询还是 100,000 个查询?)
  • 需要多长时间应用程序需要正常运行
  • 是一次性的还是必须维护?

例如,您可以尝试使用以下应用程序来为您提交数据:

然后 grep(awk 或 sed)生成的网页。

屏幕抓取时的另一个技巧是下载示例 HTML 文件并在 vi(或 VIM)中手动解析它。 将击键保存到文件中,然后每当运行查询时,将这些击键应用于生成的网页以提取数据。 此解决方案不可维护,也不是 100% 可靠(但从网站抓取屏幕的情况很少)。 它有效并且速度很快。

示例

下面是一个用于提交网站表单(专门处理登录网站)的半通用 Java 类,希望它可能有用。 请勿将其用于邪恶。

import java.io.FileInputStream;

import java.util.Enumeration;
import java.util.Hashtable;  
import java.util.Properties; 

import com.meterware.httpunit.GetMethodWebRequest;
import com.meterware.httpunit.SubmitButton;       
import com.meterware.httpunit.WebClient;          
import com.meterware.httpunit.WebConversation;    
import com.meterware.httpunit.WebForm;            
import com.meterware.httpunit.WebLink;            
import com.meterware.httpunit.WebRequest;         
import com.meterware.httpunit.WebResponse;        

public class FormElements extends Properties
{                                           
  private static final String FORM_URL = "form.url";
  private static final String FORM_ACTION = "form.action";

  /** These are properly provided property parameters. */
  private static final String FORM_PARAM = "form.param.";

  /** These are property parameters that are required; must have values. */
  private static final String FORM_REQUIRED = "form.required.";            

  private Hashtable fields = new Hashtable( 10 );

  private WebConversation webConversation;

  public FormElements()
  {                    
  }                    

  /**
   * Retrieves the HTML page, populates the form data, then sends the
   * information to the server.                                      
   */                                                                
  public void run()                                                  
    throws Exception                                                 
  {                                                                  
    WebResponse response = receive();                                
    WebForm form = getWebForm( response );                           

    populate( form );

    form.submit();
  }               

  protected WebResponse receive()
    throws Exception             
  {                              
    WebConversation webConversation = getWebConversation();
    GetMethodWebRequest request = getGetMethodWebRequest();

    // Fake the User-Agent so the site thinks that encryption is supported.
    //                                                                     
    request.setHeaderField( "User-Agent",                                  
      "Mozilla/5.0 (X11; U; Linux i686; en-US; rv\\:1.7.3) Gecko/20040913" );

    return webConversation.getResponse( request );
  }                                               

  protected void populate( WebForm form )
    throws Exception                     
  {                                      
    // First set all the .param variables.
    //                                    
    setParamVariables( form );            

    // Next, set the required variables.
    //                                  
    setRequiredVariables( form );       
  }                                     

  protected void setParamVariables( WebForm form )
    throws Exception                              
  {                                               
    for( Enumeration e = propertyNames(); e.hasMoreElements(); )
    {                                                           
      String property = (String)(e.nextElement());              

      if( property.startsWith( FORM_PARAM ) )
      {                                      
        String fieldName = getProperty( property );
        String propertyName = property.substring( FORM_PARAM.length() );
        String fieldValue = getField( propertyName );                   

        // Skip blank fields (most likely, this is a blank last name, which
        // means the form wants a full name).                              
        //                                                                 
        if( "".equals( fieldName ) )                                       
          continue;                                                        

        // If this is the first name, and the last name parameter is blank,
        // then append the last name field to the first name field.        
        //                                                                 
        if( "first_name".equals( propertyName ) &&                         
            "".equals( getProperty( FORM_PARAM + "last_name" ) ) )         
          fieldValue += " " + getField( "last_name" );                     

        showSet( fieldName, fieldValue );
        form.setParameter( fieldName, fieldValue );
      }                                            
    }                                              
  }                                                

  protected void setRequiredVariables( WebForm form )
    throws Exception                                 
  {                                                  
    for( Enumeration e = propertyNames(); e.hasMoreElements(); )
    {                                                           
      String property = (String)(e.nextElement());              

      if( property.startsWith( FORM_REQUIRED ) )
      {                                         
        String fieldValue = getProperty( property );
        String fieldName = property.substring( FORM_REQUIRED.length() );

        // If the field starts with a ~, then copy the field.
        //                                                   
        if( fieldValue.startsWith( "~" ) )                   
        {                                                    
          String copyProp = fieldValue.substring( 1, fieldValue.length() );
          copyProp = getProperty( copyProp );                              

          // Since the parameters have been copied into the form, we can   
          // eke out the duplicate values.                                 
          //                                                               
          fieldValue = form.getParameterValue( copyProp );                 
        }                                                                  

        showSet( fieldName, fieldValue );
        form.setParameter( fieldName, fieldValue );
      }                                            
    }                                              
  }                                                

  private void showSet( String fieldName, String fieldValue )
  {                                                          
    System.out.print( "<p class='setting'>" );               
    System.out.print( fieldName );                           
    System.out.print( " = " );                               
    System.out.print( fieldValue );                          
    System.out.println( "</p>" );                            
  }                                                          

  private WebForm getWebForm( WebResponse response )
    throws Exception                                
  {                                                 
    WebForm[] forms = response.getForms();          
    String action = getProperty( FORM_ACTION );     

    // Not supposed to break out of a for-loop, but it makes the code easy ...
    //                                                                        
    for( int i = forms.length - 1; i >= 0; i-- )                              
      if( forms[ i ].getAction().equalsIgnoreCase( action ) )                 
        return forms[ i ];                                                    

    // Sadly, no form was found.
    //                          
    throw new Exception();      
  }                             

  private GetMethodWebRequest getGetMethodWebRequest()
  {
    return new GetMethodWebRequest( getProperty( FORM_URL ) );
  }

  private WebConversation getWebConversation()
  {
    if( this.webConversation == null )
      this.webConversation = new WebConversation();

    return this.webConversation;
  }

  public void setField( String field, String value )
  {
    Hashtable fields = getFields();
    fields.put( field, value );
  }

  private String getField( String field )
  {
    Hashtable<String, String> fields = getFields();
    String result = fields.get( field );

    return result == null ? "" : result;
  }

  private Hashtable getFields()
  {
    return this.fields;
  }

  public static void main( String args[] )
    throws Exception
  {
    FormElements formElements = new FormElements();

    formElements.setField( "first_name", args[1] );
    formElements.setField( "last_name", args[2] );
    formElements.setField( "email", args[3] );
    formElements.setField( "comments",  args[4] );

    FileInputStream fis = new FileInputStream( args[0] );
    formElements.load( fis );
    fis.close();

    formElements.run();
  }
}

示例属性文件如下所示:

$ cat com.mellon.properties

form.url=https://www.mellon.com/contact/index.cfm
form.action=index.cfm
form.param.first_name=name
form.param.last_name=
form.param.email=emailhome
form.param.comments=comments

# Submit Button
#form.submit=submit

# Required Fields
#
form.required.to=zzwebmaster
form.required.phone=555-555-1212
form.required.besttime=5 to 7pm

类似于以下内容运行它(用 HTTPUnit 的路径和 FormElements 类替换 $CLASSPATH):

java -cp $CLASSPATH FormElements com.mellon.properties "John" "Doe" "[email protected]" "To whom it may concern  ..."

合法性

另一个答案提到它可能违反使用条款。 在花时间研究技术解决方案之前,请先检查一下。 非常好的建议。

Theory

What I would do is create a little program that can automatically submit any form data to any place and come back with the results. This is easy to do in Java with HTTPUnit. The task goes like this:

  • Connect to the web server.
  • Parse the page.
  • Get the first form on the page.
  • Fill in the form data.
  • Submit the form.
  • Read (and parse) the results.

The solution you pick will depend on a variety of factors, including:

  • Whether you need to emulate JavaScript
  • What you need to do with the data afterwards
  • What languages with which you are proficient
  • Application speed (is this for one query or 100,000?)
  • How soon the application needs to be working
  • Is it a one off, or will it have to be maintained?

For example, you could try the following applications to submit the data for you:

Then grep (awk, or sed) the resulting web page(s).

Another trick when screen scraping is to download a sample HTML file and parse it manually in vi (or VIM). Save the keystrokes to a file and then whenever you run the query, apply those keystrokes to the resulting web page(s) to extract the data. This solution is not maintainable, nor 100% reliable (but screen scraping from a website seldom is). It works and is fast.

Example

A semi-generic Java class to submit website forms (specifically dealing with logging into a website) is below, in the hopes that it might be useful. Do not use it for evil.

import java.io.FileInputStream;

import java.util.Enumeration;
import java.util.Hashtable;  
import java.util.Properties; 

import com.meterware.httpunit.GetMethodWebRequest;
import com.meterware.httpunit.SubmitButton;       
import com.meterware.httpunit.WebClient;          
import com.meterware.httpunit.WebConversation;    
import com.meterware.httpunit.WebForm;            
import com.meterware.httpunit.WebLink;            
import com.meterware.httpunit.WebRequest;         
import com.meterware.httpunit.WebResponse;        

public class FormElements extends Properties
{                                           
  private static final String FORM_URL = "form.url";
  private static final String FORM_ACTION = "form.action";

  /** These are properly provided property parameters. */
  private static final String FORM_PARAM = "form.param.";

  /** These are property parameters that are required; must have values. */
  private static final String FORM_REQUIRED = "form.required.";            

  private Hashtable fields = new Hashtable( 10 );

  private WebConversation webConversation;

  public FormElements()
  {                    
  }                    

  /**
   * Retrieves the HTML page, populates the form data, then sends the
   * information to the server.                                      
   */                                                                
  public void run()                                                  
    throws Exception                                                 
  {                                                                  
    WebResponse response = receive();                                
    WebForm form = getWebForm( response );                           

    populate( form );

    form.submit();
  }               

  protected WebResponse receive()
    throws Exception             
  {                              
    WebConversation webConversation = getWebConversation();
    GetMethodWebRequest request = getGetMethodWebRequest();

    // Fake the User-Agent so the site thinks that encryption is supported.
    //                                                                     
    request.setHeaderField( "User-Agent",                                  
      "Mozilla/5.0 (X11; U; Linux i686; en-US; rv\\:1.7.3) Gecko/20040913" );

    return webConversation.getResponse( request );
  }                                               

  protected void populate( WebForm form )
    throws Exception                     
  {                                      
    // First set all the .param variables.
    //                                    
    setParamVariables( form );            

    // Next, set the required variables.
    //                                  
    setRequiredVariables( form );       
  }                                     

  protected void setParamVariables( WebForm form )
    throws Exception                              
  {                                               
    for( Enumeration e = propertyNames(); e.hasMoreElements(); )
    {                                                           
      String property = (String)(e.nextElement());              

      if( property.startsWith( FORM_PARAM ) )
      {                                      
        String fieldName = getProperty( property );
        String propertyName = property.substring( FORM_PARAM.length() );
        String fieldValue = getField( propertyName );                   

        // Skip blank fields (most likely, this is a blank last name, which
        // means the form wants a full name).                              
        //                                                                 
        if( "".equals( fieldName ) )                                       
          continue;                                                        

        // If this is the first name, and the last name parameter is blank,
        // then append the last name field to the first name field.        
        //                                                                 
        if( "first_name".equals( propertyName ) &&                         
            "".equals( getProperty( FORM_PARAM + "last_name" ) ) )         
          fieldValue += " " + getField( "last_name" );                     

        showSet( fieldName, fieldValue );
        form.setParameter( fieldName, fieldValue );
      }                                            
    }                                              
  }                                                

  protected void setRequiredVariables( WebForm form )
    throws Exception                                 
  {                                                  
    for( Enumeration e = propertyNames(); e.hasMoreElements(); )
    {                                                           
      String property = (String)(e.nextElement());              

      if( property.startsWith( FORM_REQUIRED ) )
      {                                         
        String fieldValue = getProperty( property );
        String fieldName = property.substring( FORM_REQUIRED.length() );

        // If the field starts with a ~, then copy the field.
        //                                                   
        if( fieldValue.startsWith( "~" ) )                   
        {                                                    
          String copyProp = fieldValue.substring( 1, fieldValue.length() );
          copyProp = getProperty( copyProp );                              

          // Since the parameters have been copied into the form, we can   
          // eke out the duplicate values.                                 
          //                                                               
          fieldValue = form.getParameterValue( copyProp );                 
        }                                                                  

        showSet( fieldName, fieldValue );
        form.setParameter( fieldName, fieldValue );
      }                                            
    }                                              
  }                                                

  private void showSet( String fieldName, String fieldValue )
  {                                                          
    System.out.print( "<p class='setting'>" );               
    System.out.print( fieldName );                           
    System.out.print( " = " );                               
    System.out.print( fieldValue );                          
    System.out.println( "</p>" );                            
  }                                                          

  private WebForm getWebForm( WebResponse response )
    throws Exception                                
  {                                                 
    WebForm[] forms = response.getForms();          
    String action = getProperty( FORM_ACTION );     

    // Not supposed to break out of a for-loop, but it makes the code easy ...
    //                                                                        
    for( int i = forms.length - 1; i >= 0; i-- )                              
      if( forms[ i ].getAction().equalsIgnoreCase( action ) )                 
        return forms[ i ];                                                    

    // Sadly, no form was found.
    //                          
    throw new Exception();      
  }                             

  private GetMethodWebRequest getGetMethodWebRequest()
  {
    return new GetMethodWebRequest( getProperty( FORM_URL ) );
  }

  private WebConversation getWebConversation()
  {
    if( this.webConversation == null )
      this.webConversation = new WebConversation();

    return this.webConversation;
  }

  public void setField( String field, String value )
  {
    Hashtable fields = getFields();
    fields.put( field, value );
  }

  private String getField( String field )
  {
    Hashtable<String, String> fields = getFields();
    String result = fields.get( field );

    return result == null ? "" : result;
  }

  private Hashtable getFields()
  {
    return this.fields;
  }

  public static void main( String args[] )
    throws Exception
  {
    FormElements formElements = new FormElements();

    formElements.setField( "first_name", args[1] );
    formElements.setField( "last_name", args[2] );
    formElements.setField( "email", args[3] );
    formElements.setField( "comments",  args[4] );

    FileInputStream fis = new FileInputStream( args[0] );
    formElements.load( fis );
    fis.close();

    formElements.run();
  }
}

An example properties files would look like:

$ cat com.mellon.properties

form.url=https://www.mellon.com/contact/index.cfm
form.action=index.cfm
form.param.first_name=name
form.param.last_name=
form.param.email=emailhome
form.param.comments=comments

# Submit Button
#form.submit=submit

# Required Fields
#
form.required.to=zzwebmaster
form.required.phone=555-555-1212
form.required.besttime=5 to 7pm

Run it similar to the following (substitute the path to HTTPUnit and the FormElements class for $CLASSPATH):

java -cp $CLASSPATH FormElements com.mellon.properties "John" "Doe" "[email protected]" "To whom it may concern  ..."

Legality

Another answer mentioned that it might violate terms of use. Check into that first, before you spend any time looking into a technical solution. Extremely good advice.

土豪 2024-08-03 22:27:43

大多数时候,您只需发送一个简单的 HTTP POST 请求即可。

我建议您尝试使用 Fiddler 来了解网络的工作原理。

几乎所有的编程语言和框架都有发送原始请求的方法。

您始终可以针对 Internet Explorer ActiveX 控件进行编程。 我相信很多编程语言都支持它。

Most of the time, you can just send a simple HTTP POST request.

I'd suggest you try playing around with Fiddler to understand how the web works.

Nearly all the programming languages and frameworks out there have methods for sending raw requests.

And you can always program against the Internet Explorer ActiveX control. I believe it many programming languages supports it.

巴黎夜雨 2024-08-03 22:27:43

我相信这会在法律上违反使用条款(请咨询律师:程序员不擅长提供法律建议!),但是,从技术上讲,您可以通过访问 URL http://www.google.com/search?q=foobar ,正如你所说,抓取结果HTML。 您可能还需要伪造 User-Agent HTTP 标头以及其他一些标头。

也许有些搜索引擎的使用条款并不禁止这样做; 强烈建议您和您的律师四处看看,看看情况是否确实如此。

I believe this would put in legal violation of the terms of use (consult a lawyer about that: programmers are not good at giving legal advice!), but, technically, you could search for foobar by just visiting URL http://www.google.com/search?q=foobar and, as you say, scraping the resulting HTML. You'll probably also need to fake out the User-Agent HTTP header and maybe some others.

Maybe there are search engines whose terms of use do not forbid this; you and your lawyer might be well advised to look around to see if this is indeed the case.

葬心 2024-08-03 22:27:43

好吧,这是来自 Google 页面的 html:

<form action="/search" name=f><table cellpadding=0 cellspacing=0><tr valign=top>
<td width=25%> </td><td align=center nowrap>
<input name=hl type=hidden value=en>
<input type=hidden name=ie value="ISO-8859-1">
<input autocomplete="off" maxlength=2048 name=q size=55 title="Google Search" value="">
<br>
<input name=btnG type=submit value="Google Search">
<input name=btnI type=submit value="I'm Feeling Lucky">
</td><td nowrap width=25% align=left>
<font size=-2>  <a href=/advanced_search?hl=en>
Advanced Search</a><br>  
<a href=/preferences?hl=en>Preferences</a><br>  
<a href=/language_tools?hl=en>Language Tools</a></font></td></tr></table>
</form>

如果您知道如何使用您最喜欢的编程语言发出 HTTP 请求,请尝试一下,看看会得到什么结果。 例如尝试这个:

http://www.google.com/search?hl=en&q=Stack+Overflow

Well, here's the html from the Google page:

<form action="/search" name=f><table cellpadding=0 cellspacing=0><tr valign=top>
<td width=25%> </td><td align=center nowrap>
<input name=hl type=hidden value=en>
<input type=hidden name=ie value="ISO-8859-1">
<input autocomplete="off" maxlength=2048 name=q size=55 title="Google Search" value="">
<br>
<input name=btnG type=submit value="Google Search">
<input name=btnI type=submit value="I'm Feeling Lucky">
</td><td nowrap width=25% align=left>
<font size=-2>  <a href=/advanced_search?hl=en>
Advanced Search</a><br>  
<a href=/preferences?hl=en>Preferences</a><br>  
<a href=/language_tools?hl=en>Language Tools</a></font></td></tr></table>
</form>

If you know how to make an HTTP request from your favorite programming language, just give it a try and see what you get back. Try this for instance:

http://www.google.com/search?hl=en&q=Stack+Overflow
山川志 2024-08-03 22:27:43

如果您下载 Cygwin,并将 Cygwin\bin 添加到您的路径中,您可以使用 curl 检索页面并使用 grep/sed/whatever 来解析结果。 既然谷歌可以使用查询字符串参数,为什么还要填写表单呢? 使用curl,您还可以发布信息、设置标头信息等。我用它从命令行调用Web 服务。

If you download Cygwin, and add Cygwin\bin to your path you can use curl to retrieve a page and grep/sed/whatever to parse the results. Why fill out the form when with google you can use the querystring parameters, anyway? With curl, you can post info, too, set header info, etc. I use it to call web services from a command line.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文