最好的屏幕抓取工具,简单的 html dom 还是史努比?

发布于 2024-08-12 09:46:25 字数 197 浏览 8 评论 0原文

哪一种更适合屏幕抓取? 简单的 html domsnoopy ? 我使用简单的 html dom 并发现它很舒服.. snoopy 与简单的 html dom 相比有什么优势吗?

我的要求:如果我想从页面上抓取内容(登录后).. 简单的 html dom 很容易,但是打印结果需要很多时间。

which one is better for screen scraping? simple html dom or snoopy ??
i use simple html dom and find it comfortable..
does snoopy has any advantage over simple html dom?

my requirements : if i wanna scrape contents from a page(after login)..
simple html dom is easy but it takes a lotta time to print the results..

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

孤蝉 2024-08-19 09:46:25

史努比是一个众所周知/成熟的软件包吗?

如果不是,那么在所有其他条件相同的情况下,我可能会使用通用 HTML DOM 代码 - 特别是如果抓取有点简单的话。

但只有你知道你的代码何时开始变得太大、难以管理等,此时最好看看另一个工具,比如 Snoopy。

(诚​​然,我没有这方面的经验;显然位于 http://sourceforge.net/projects /snoopy/ 对于那些不熟悉它的人 - “Snoopy 是一个模拟 Web 浏览器的 PHP 类。例如,它自动执行检索网页内容和发布表单的任务。”)

我的真正原因尽管我本身不了解史努比,因此无法明确回答您的问题,但发帖是为了询问您是否考虑过使用 Selenium (http://www.seleniumhq.org/)而不是史努比。

Selenium 是一个相当知名的测试工具,我想到使用它来完成你正在做的事情(如果可以的话)的好处之一是它内置了测试。

这样做的好处是,屏幕抓取本质上是一项脆弱的任务 - 如果目标站点更改了某些内容,那么,你的抓取就会失败。因此,拥有一个自动抓取/测试抓取工作的系统是一个很好的设计。

无论如何,还是要考虑一下。

Is Snoopy that well known / mature of a package?

If it's not, then all other things being equal, I'd probably go with generic HTML DOM code - especially if the scraping is somewhat simple.

But only you know when your code is starting to get too big, unmanageable, etc., at which point it might be better to look at another tool out there like Snoopy.

(Which, admittedly, I don't have experience with; it's apparently at http://sourceforge.net/projects/snoopy/ for those not familiar with it - "Snoopy is a PHP class that simulates a web browser. It automates the task of retrieving web page content and posting forms, for example.")

The real reason I'm posting, even though I don't know Snoopy per se and thus can't definitively answer your question, is to ask if you've considered using Selenium (http://www.seleniumhq.org/) instead of Snoopy.

Selenium is a fairly well-known testing tool, and it occurred to me that one of the nice things about using that for what you're doing (if you can) is that it has built in tests.

The reason that's good is that screen scraping is kind of an inherently brittle task - if the target site changes something, blam, your scraping fails. So it's kind of a nice design to have an automated scrape/test-that-scraping-worked system.

Something to think about, anyway.

杀手六號 2024-08-19 09:46:25

我偶然发现了 BeautifulSoup,它是基于 Python 的。我想还有很多其他人。

看起来 Snoopy 是基于 PHP 的,因此只能在服务器端运行。这是您真正在寻找的吗?您有什么要求?请详细说明这一点。

I've stumbled into BeautifulSoup, which is Python-based. I suppose there are a bunch of others too.

Looks like Snoopy is PHP-based, and hence can be run server-side only. Is this what you are really looking for? What are your requirements? Please elaborate on that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文