如何在不使用 API 的情况下以编程方式执行搜索?
我想创建一个程序,将字符串输入到 Google 等网站的文本框中(不使用其公共 API),然后提交表单并获取结果。 这可能吗? 我认为抓取结果需要使用 HTML 抓取,但是如何在文本字段中输入数据并提交表单呢? 我会被迫使用公共 API 吗? 难道这样的事情根本不可行吗? 我必须弄清楚查询字符串/参数吗?
谢谢
I would like to create a program that will enter a string into the text box on a site like Google (without using their public API) and then submit the form and grab the results. Is this possible? Grabbing the results will require the use of HTML scraping I would assume, but how would I enter data into the text field and submit the form? Would I be forced to use a public API? Is something like this just not feasible? Would I have to figure out query strings/parameters?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
理论
我要做的是创建一个小程序,可以自动将任何表单数据提交到任何地方并返回结果。 在 Java 中使用 HTTPUnit 可以很容易地做到这一点。 任务如下:
您选择的解决方案将取决于多种因素,包括:
例如,您可以尝试使用以下应用程序来为您提交数据:
然后 grep(awk 或 sed)生成的网页。
屏幕抓取时的另一个技巧是下载示例 HTML 文件并在 vi(或 VIM)中手动解析它。 将击键保存到文件中,然后每当运行查询时,将这些击键应用于生成的网页以提取数据。 此解决方案不可维护,也不是 100% 可靠(但从网站抓取屏幕的情况很少)。 它有效并且速度很快。
示例
下面是一个用于提交网站表单(专门处理登录网站)的半通用 Java 类,希望它可能有用。 请勿将其用于邪恶。
示例属性文件如下所示:
类似于以下内容运行它(用 HTTPUnit 的路径和 FormElements 类替换 $CLASSPATH):
合法性
另一个答案提到它可能违反使用条款。 在花时间研究技术解决方案之前,请先检查一下。 非常好的建议。
Theory
What I would do is create a little program that can automatically submit any form data to any place and come back with the results. This is easy to do in Java with HTTPUnit. The task goes like this:
The solution you pick will depend on a variety of factors, including:
For example, you could try the following applications to submit the data for you:
Then grep (awk, or sed) the resulting web page(s).
Another trick when screen scraping is to download a sample HTML file and parse it manually in vi (or VIM). Save the keystrokes to a file and then whenever you run the query, apply those keystrokes to the resulting web page(s) to extract the data. This solution is not maintainable, nor 100% reliable (but screen scraping from a website seldom is). It works and is fast.
Example
A semi-generic Java class to submit website forms (specifically dealing with logging into a website) is below, in the hopes that it might be useful. Do not use it for evil.
An example properties files would look like:
Run it similar to the following (substitute the path to HTTPUnit and the FormElements class for $CLASSPATH):
Legality
Another answer mentioned that it might violate terms of use. Check into that first, before you spend any time looking into a technical solution. Extremely good advice.
大多数时候,您只需发送一个简单的 HTTP POST 请求即可。
我建议您尝试使用 Fiddler 来了解网络的工作原理。
几乎所有的编程语言和框架都有发送原始请求的方法。
您始终可以针对 Internet Explorer ActiveX 控件进行编程。 我相信很多编程语言都支持它。
Most of the time, you can just send a simple HTTP POST request.
I'd suggest you try playing around with Fiddler to understand how the web works.
Nearly all the programming languages and frameworks out there have methods for sending raw requests.
And you can always program against the Internet Explorer ActiveX control. I believe it many programming languages supports it.
我相信这会在法律上违反使用条款(请咨询律师:程序员不擅长提供法律建议!),但是,从技术上讲,您可以通过访问 URL http://www.google.com/search?q=foobar ,正如你所说,抓取结果HTML。 您可能还需要伪造
User-Agent
HTTP 标头以及其他一些标头。也许有些搜索引擎的使用条款并不禁止这样做; 强烈建议您和您的律师四处看看,看看情况是否确实如此。
I believe this would put in legal violation of the terms of use (consult a lawyer about that: programmers are not good at giving legal advice!), but, technically, you could search for foobar by just visiting URL http://www.google.com/search?q=foobar and, as you say, scraping the resulting HTML. You'll probably also need to fake out the
User-Agent
HTTP header and maybe some others.Maybe there are search engines whose terms of use do not forbid this; you and your lawyer might be well advised to look around to see if this is indeed the case.
好吧,这是来自 Google 页面的 html:
如果您知道如何使用您最喜欢的编程语言发出 HTTP 请求,请尝试一下,看看会得到什么结果。 例如尝试这个:
Well, here's the html from the Google page:
If you know how to make an HTTP request from your favorite programming language, just give it a try and see what you get back. Try this for instance:
如果您下载 Cygwin,并将 Cygwin\bin 添加到您的路径中,您可以使用 curl 检索页面并使用 grep/sed/whatever 来解析结果。 既然谷歌可以使用查询字符串参数,为什么还要填写表单呢? 使用curl,您还可以发布信息、设置标头信息等。我用它从命令行调用Web 服务。
If you download Cygwin, and add Cygwin\bin to your path you can use curl to retrieve a page and grep/sed/whatever to parse the results. Why fill out the form when with google you can use the querystring parameters, anyway? With curl, you can post info, too, set header info, etc. I use it to call web services from a command line.