抓取动态网站

发布于 2024-07-07 13:06:12 字数 289 浏览 5 评论 0原文

抓取大部分内容都是由 ajax 请求生成的动态网站的最佳方法是什么? 我之前有过使用 Mechanize、BeautifulSoup 和 python 组合的经验,但我准备尝试一些新的东西。

- 编辑 - 有关更多详细信息:我正在尝试抓取 CNN 主数据库。 那里有丰富的信息,但似乎没有 api。

What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.

--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

秋心╮凉 2024-07-14 13:06:12

我发现的最佳解决方案是使用 Firebug 监视 XmlHttpRequest,然后使用脚本重新发送它们。

The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.

只等公子 2024-07-14 13:06:12

这是一个难题,因为您要么必须在每个站点上对 JavaScript 进行逆向工程,要么实现 JavaScript 引擎并运行脚本(这有其自身的困难和陷阱)。

这是一个重量级的解决方案,但我见过人们使用 GreaseMonkey 脚本来做到这一点 - 允许 Firefox 渲染所有内容并运行 JavaScript,然后抓取元素。 如果需要,您甚至可以在页面上启动用户操作。

This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).

It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.

不…忘初心 2024-07-14 13:06:12

Selenium IDE 是一个测试工具,我经常使用它来进行屏幕抓取。 有一些事情它不能很好地处理(Javascript window.alert() 和一般的弹出窗口),但它通过实际触发单击事件并在文本框中键入内容来在页面上完成工作。 因为 IDE 部分在 Firefox 中运行,所以您不必执行所有会话管理等工作,因为 Firefox 会处理这些事情。 IDE 记录并回放测试。

它还导出 C#、PHP、Java 等代码来构建在 Selenium 服务器上执行的编译测试/抓取器。 我已经对多个 Selenium 脚本执行了此操作,这使得将抓取的数据存储在数据库中之类的事情变得更加容易。

脚本的编写和更改相当简单,由诸如(“clickAndWait”,“submitButton”)之类的内容组成。 鉴于您所描述的内容,值得一看。

Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.

It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.

Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.

东北女汉子 2024-07-14 13:06:12

亚当·戴维斯的建议很可靠。

我还建议您尝试“逆向工程”JavaScript 正在执行的操作,而不是尝试抓取页面,而是发出 JavaScript 发出的 HTTP 请求并自己解释结果(很可能采用 JSON 格式,很好而且很容易解析)。 这个策略可能是微不足道的,也可能是一场彻头彻尾的噩梦,具体取决于 JavaScript 的复杂性。

当然,最好的可能性是说服网站维护人员实施对开发人员友好的 API。 现在所有很酷的孩子都在这样做 8-) 当然,他们可能不希望以自动方式抓取数据...在这种情况下,您可以期待一场猫捉老鼠的游戏,使他们的页面越来越难以抓取:-(

Adam Davis's advice is solid.

I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.

The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(

剩余の解释 2024-07-14 13:06:12

虽然有一定的学习曲线,但 Pamie (Python) 或 Watir (Ruby) 等工具可以让您融入 IE Web 浏览器并了解其中的元素。 事实证明,这比 Mechanize 和其他 HTTP 级别的工具更容易,因为您不必模拟浏览器,您只需向浏览器询问 html 元素即可。 这比对 Javascript/Ajax 调用进行逆向工程要容易得多。 如果需要,您还可以将 beautiful soup 等工具与 Pami 结合使用。

There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.

绝對不後悔。 2024-07-14 13:06:12

也许最简单的方法是在 C#(或任何其他语言)中使用 IE 网页浏览器控件。 您可以直接访问浏览器内的所有内容 + 您无需关心 cookie、SSL 等。

Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.

春庭雪 2024-07-14 13:06:12

我发现 IE Webbrowser 控件有各种怪癖和解决方法,这些怪癖和解决方法证明一些高质量的软件可以解决所有这些不一致问题,围绕 shvwdoc.dll api 和 mshtml 分层并提供一个框架。

i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.

妄断弥空 2024-07-14 13:06:12

这似乎是一个很常见的问题。 我想知道为什么没有人开发程序化浏览器? 我设想一个 Firefox,您可以从命令行调用,并以 URL 作为参数,它将加载页面,运行所有初始页面加载 JS 事件并保存结果文件。

我的意思是 Firefox 和其他浏览器已经做到了这一点,为什么我们不能简单地去掉 UI 的东西呢?

This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.

I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文