从外部获取网站内容的最佳方式

发布于 2024-07-25 04:46:32 字数 344 浏览 6 评论 0原文

有一个搜索网站,其搜索结果是由 javascript 动态生成的。 因此,用户输入查询,站点会在页面上显示内容,而无需刷新。

我需要以编程方式获取这些搜索结果(例如从 Java 程序或 perl/python 脚本)。

因此,理想情况下,我可以使用 100 个查询作为用户输入来启动我的程序,然后该程序将使用每个查询访问该网站,并在我的屏幕上吐出该网站返回的所有搜索结果。

明显的问题是,该网站使用的是 javasript,而不是简单的 HTML,因此发送 URL 请求并解析结果输出是行不通的(因为该页面的源代码始终只是一堆对各种 .js 文件的引用) )。

鉴于上述条件,我有什么选择?

There's a search site whose search results are generated dynamically by javascript. So the user enters a query, and the site displays the content on the page, without refreshing.

I need to grab those search results programmatically (say from a Java program or a perl/python script).

So ideally, I can launch my program with 100 queries as user inputs, and then the program would hit that website with each query and spit out on my screen all the search results as returned by the website.

The obvious problem is that the site is in javasript instead of simple HTML, so sending a URL request and parsing the resultant output is not going to work (as the source code for this page is always just a bunch of reference to various .js files).

Given the above conditions, what are my options?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

何止钟意 2024-08-01 04:46:32

除非搜索提供商为您提供可使用的 API(通过反向渠道协议或公开可用的 API),否则您所做的任何事情都可能不会持续很长时间。

您可能会煞费苦心地欺骗网站,让其相信您是普通网站用户。 然后,他们会对网站的工作方式做出一些微小的改变(因为他们不知道有人以你的方式使用它),突然之间你的黑客就不起作用了。 一段时间后,他们可能会注意到您正在以这种方式使用它们,并检测到您的使用情况并立即阻止它。

基本上,除非他们给你一个 API,否则你本质上是在偷窃,并且应该期望得到所有应得的礼貌……没有。

以免你认为我在评判你,我会让你知道我是根据经验说话的;)

Unless the search provider gives you an API to work with (either via backchannel agreement or a publicly available one), then nothing you do will be likely to work for very long.

You may go to great pains to fool the site into believing you are an ordinary website user. Then, they will make some minor change to how their site works (because they have no idea someone is using it in the fashion you are) and all of a sudden your hack won't work. Sometime later, they may notice that you are using them in this fashion, and detect your usage and flat out block it.

Basically, unless they give you an API, you are essentially stealing, and should expect to receive all the courtesy that deserves... none.

Lest you think I am judging you, I'll let you know I speak from experience ;)

魂归处 2024-08-01 04:46:32

Javascript 的 http 请求几乎就像浏览器一样,一旦你弄清楚它们是什么,你就可以尝试在 perl/python/等中重新创建它们。
使用 Firefox+Firebug,您可以在“网络”面板中看到请求。

你可能需要考虑的是用户代理字符串、cookies、有时返回的数据是由 Javascript 运行/解释的事实等。也许你选择的语言有一个很好的 httpbrowser 类你可以使用?


只是看了一下,搜索 IBM,从 Firebug 获取帖子数据,用“&”替换换行符 并将其放在请求网址之后:

[http://bcode.bloomberg.com/sym/dwr/call/plaincall/searchMgr.search.dwr?callCount=1&windowName=&c0-scriptName=searchMgr&c0-methodName=search&c0-id=0&c0-e1=string:ibm&c0-e2=string:&c0-e3=number:100&c0-e4=number:0&c0-e5=boolean:false&c0-param0=Object_SearchCriteria:{search:reference:c0-e1,%20filter:reference:c0-e2,%20limit:reference:c0-e3&,%20start:reference:c0-e4,%20allSources:reference:c0-e5}&batchId=4&page=%2Fsym%2F&httpSessionId=&scriptSessionId=FBC68693A4E1BC08D6E0DDFBDF6D0860]

但它返回

throw 'allowScriptTagRemoting is false.';
//#DWR-REPLY
if (window.dwr) dwr.engine.remote.handleBatchException({ name:'java.lang.SecurityException', message:'GET Disallowed' });
else if (window.parent.dwr) window.parent.dwr.engine.remote.handleBatchException({ name:'java.lang.SecurityException', message:'GET Disallowed' });

并且没有数据..所以看来您必须编写一个 post 请求脚本。 看看他们的限制和指南,也许您应该联系并询问是否有公共 API?

Javascript does http requests almost just like a browser does, once you figure out what they are you can try to re-create them in perl/python/etc.
With Firefox+Firebug you can see the requests in the 'Net' panel.

Things you might have to take into account are user-agent string, cookies, the fact that sometimes the returned data is meant to be run/interpreted by Javascript etc. Maybe your language of choice has a nice httpbrowser class you can use?


Just took a look, searching for IBM, took the post data from Firebug, replaced newlines with '&' and put it after the request url:

[http://bcode.bloomberg.com/sym/dwr/call/plaincall/searchMgr.search.dwr?callCount=1&windowName=&c0-scriptName=searchMgr&c0-methodName=search&c0-id=0&c0-e1=string:ibm&c0-e2=string:&c0-e3=number:100&c0-e4=number:0&c0-e5=boolean:false&c0-param0=Object_SearchCriteria:{search:reference:c0-e1,%20filter:reference:c0-e2,%20limit:reference:c0-e3&,%20start:reference:c0-e4,%20allSources:reference:c0-e5}&batchId=4&page=%2Fsym%2F&httpSessionId=&scriptSessionId=FBC68693A4E1BC08D6E0DDFBDF6D0860]

but it returns

throw 'allowScriptTagRemoting is false.';
//#DWR-REPLY
if (window.dwr) dwr.engine.remote.handleBatchException({ name:'java.lang.SecurityException', message:'GET Disallowed' });
else if (window.parent.dwr) window.parent.dwr.engine.remote.handleBatchException({ name:'java.lang.SecurityException', message:'GET Disallowed' });

and no data.. So it looks like you have to script a post request. Looking at their restrictions and guidelines, maybe you should just get in touch and ask if there's a public API?

那一片橙海, 2024-08-01 04:46:32

安装 Firebug,研究站点 javascript 发出的请求并在您的程序中模仿它们。 很可能需要发出一个请求,结果将以某种不错的形式出现,例如 JSON

Install Firebug, study the requests that are made by the site's javascript and mimic them in your program. Chances are there is a single request that needs to be made and the resulds would come in some nice form like JSON

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文