超快速的屏幕抓取技术?

发布于 2024-07-14 01:17:48 字数 1542 浏览 5 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

过期情话 2024-07-21 01:17:48

HtmlUnit 是一个用 Java 编写的可编写脚本的无头浏览器。 我们将它用于一些故障极其严重、复杂的网页,它通常效果很好。

为了进一步简化事情,您可以在 Jython 中运行它。 最终的程序读起来更像是如何使用浏览器的记录,而不是辛苦的工作。

HtmlUnit is a scriptable, headless browser written in Java. We use it for some extremely fault-heavy, complex web pages and it usually does a very good job.

To simplify things even more you can run it in Jython. The resultant program reads more like a transcript of how one might use a browser than hard work.

青芜 2024-07-21 01:17:48

你没有提到你想用它做什么; 如果让 Web 浏览器重复您的操作是一种可接受的解决方案,一种解决方案是使用 Selenium 等工具简单地为您的 Web 浏览器“编写脚本”。 您可以使用 Selenium IDE 记录您所做的事情,然后更改参数。

You don't mention what you want to use this for; One solution is to simply "script" your web browser using tools like Selenium if having a web browser repeat your actions is an acceptable solution. You can use the Selenium IDE to record what you do and then alter the parameters.

我希望我可以快速“记录我的会话”,然后对每个会话中不同的内容进行参数化。

如果您有 Visual Studio 测试版,它的 Web 测试功能可以准确执行此操作。 如果您不使用 VS 或者想要一个独立的工具,我在 OpenSpan 方面取得了巨大成功。 它不仅仅是 Web,它还支持 Windows 应用程序和 Java!

I wish I could just "record my session" quickly and then parametrize the things that vary from session to session.

If you have Visual Studio test edition it's web test function does that exactly. If you aren't using VS or want a stand alone tool I have had great success with OpenSpan. It is more than just web, it does windows apps, and java!

度的依靠╰つ 2024-07-21 01:17:48

Selenium 将是我的第一选择,因为 IDE 可以让您通过“记录”会话来轻松完成很多事情。 但是,如果您对它提供的功能不满意,您还可以使用名为 Beautiful Soup 的 Python 模块 以编程方式浏览网站。

Selenium would be my 1st pick, as the IDE lets you do a lot of things the easy way by "recording" a session for you. But, if you're not happy with what it provides, you can also use the Python module called Beautiful Soup to programmatically walk through a website.

我一直都在从未离去 2024-07-21 01:17:48

Python 和 Perl 都有一个名为 Mechanize(perl 的 WWW::Mechanize)的模块,它使得以编程方式执行浏览器行为(填写表单、处理 cookie 等)变得容易

所以,Python + BeautifulSoup(很棒的 html/xml 解析器)+ mechanize(浏览器功能)= 超级简单/快速的抓取工具

Python and Perl both have a module called Mechanize (WWW::Mechanize for perl) that makes it easy to do browser behavior programmaticly (filling out forms, handling cookies, etc).

So, Python + BeautifulSoup (great html/xml parser) + mechanize (browser functions) = super easy/fast scraper

小镇女孩 2024-07-21 01:17:48

我使用 DomInspector 手动检查感兴趣的站点以参数化其结构。 然后是简单的 Apache HttpClient 和使用此参数化结构的手工解析器。 基本上我可以通过稍微调整参数来自动从任何站点提取任何信息。它类似于 SAX 解析器的工作方式,您需要告诉它的是您想要开始抓取数据的标签顺序。 例如,谷歌有非常标准的搜索结果格式。因此,您只需运行到第三次出现的“tab”,并开始从第一个“div”开始获取文本,直到最后一个“/div”

I used DomInspector for manually inspecting the site of interest to parametrize it's structure. Then simple Apache HttpClient and hand-made parser using this parametrized structure. Basically I could extract any info from any site automatically with a little tweak of parameters.. It's similar to how SAX parser works, all you need to tell it is at what sequence of tags you want to start grabbing the data. For example, google have pretty standard format of search results.. So, you just run to the third occurrence of 'tab' and start getting text from the first 'div' up until the end '/div'

南汐寒笙箫 2024-07-21 01:17:48

Internet Explorer 支持浏览器帮助程序对象 (BHO)。 他们可以访问 IE 的 HWND(窗口句柄),并且很容易从那里抓取像素。 IWebBrowser2 COM 接口还允许您访问 HTTP 请求,并且您可以通过 IWebBrowser2::Document = IHTMLDocument / IHTMLDocument2 /IHTMLDocument3 返回解析后的 H​​TML 文档

Internet Explorer supports Browser Helper Objects (BHOs). They can access IE' HWND (window handle) and it's easy to scrape the pixels from there. The IWebBrowser2 COM interface also gives you access to the HTTP requests, and you can get back the parsed HTML document via IWebBrowser2::Document = IHTMLDocument / IHTMLDocument2 /IHTMLDocument3

漫漫岁月 2024-07-21 01:17:48

使用 FireFox,应该可以通过其对插件和增强功能的强大支持来实现其中的大部分功能,但这并不意味着“无头”运行,而是真正成为一个真正的脚本化浏览器。 另外,我似乎记得读过谷歌的 Chrome 浏览器使用类似的技术来进行自动回归测试。

Using FireFox, it should be possible to implement much of it with its powerful support for addons and enhancements, however that wouldn't really mean to run "headless", but really be a real scripted browser. Also, I seem to recall having read that google's chrome browser uses a similar technique to do automated regression testing.

赢得她心 2024-07-21 01:17:48

我个人不能保证这一点,但是有一个免费的 Firefox 插件:DejaClick
前几天我安装了它,并用它做了一些补救性的录制、播放和脚本编辑活动。 它让他们在没有太多学习曲线的情况下就成功了。 如果您的最终目标是在网络浏览器中显示某些内容,那么它就足够了。

他们提供网络交易监控服务,这意味着您可以导出脚本用于其他用途,但它们可能过于专有,无法在您的网络浏览器/他们的付费服务之外使用。

http://www.dejaclick.com/

I can't personally vouch for it, but there is a free firefox plugin: DejaClick
I installed it the other day and did some remedial recording, playback, and script editing activities with it. It pulled them off without much of a learning curve. If your end goal is to show something in a web browser, then it should suffice.

They offer web transaction monitoring services, implying that you can export the scripts for other uses, but they may be too proprietary to use outside of your web browser / their paid service.

http://www.dejaclick.com/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文