屏幕抓取工具如何工作?

发布于 2024-07-06 15:10:45 字数 1453 浏览 10 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

记忆之渊 2024-07-13 15:10:45

从技术上讲,屏幕抓取是任何抓取另一个程序的显示数据并将其摄取以供自己使用的程序。

通常,屏幕截图是指解析目标网站的 HTML 页面以提取格式化数据的 Web 客户端。 当网站不提供 RSS 提要或用于以编程方式访问数据的 REST API 时,就会发生这种情况。

用于此目的的库的一个示例是用于 Ruby 的 Hpricot,它是架构更好的 HTML 之一用于屏幕抓取的解析器。

Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.

Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.

One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.

拥有 2024-07-13 15:10:45

这里有很多准确的答案。

没有人说的是不要这样做!

屏幕抓取是当没有人为您提供合理的机器可读界面时您所做的事情。 很难写,而且很脆弱。

作为一个例子,考虑一个 RSS 聚合器,然后考虑通过正常的以人为本的博客界面获取相同信息的代码。 当博主决定改变布局时,哪一个会被破坏?

当然,有时候你别无选择:(

Lots of accurate answers here.

What nobody's said is don't do it!

Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.

As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout?

Of course, sometimes you have no choice :(

忘你却要生生世世 2024-07-13 15:10:45

一般来说,屏幕抓取程序是一种程序,它通过使用浏览器或终端访问程序模仿坐在工作站前面的人的操作来捕获服务器程序的输出。 在某些关键点,程序将解释输出,然后采取行动或从输出中提取一定量的信息。

最初,这是通过大型机的字符/终端输出来完成的,用于提取数据或更新过时的或最终用户无法直接访问的系统。 用现代术语来说,它通常意味着解析 HTTP 请求的输出以提取数据或采取其他操作。 随着 Web 服务的出现,这种事情应该已经消失,但并不是所有应用程序都提供良好的 api 进行交互。

In general a screen scraper is a program that captures output from a server program by mimicing the actions of a person sitting in front of the workstation using a browser or terminal access program. at certain key points the program would interpret the output and then take an action or extract certain amounts of information from the output.

Originally this was done with character/terminal outputs from mainframes for extracting data or updating systems that were archaic or not directly accessible to the end user. in modern terms it usually means parsing the output from an HTTP request to extract data or to take some other action. with the advent of web services this sort of thing should have died away, but not all apps provide a nice api to interact with.

葬心 2024-07-13 15:10:45

屏幕抓取工具下载 html 页面,并通过搜索已知标记或将其解析为 XML 等方式提取感兴趣的数据。

A screen scraper downloads the html page, and pulls out the data interested either by searching for known tokens or parsing it as XML or some such.

心病无药医 2024-07-13 15:10:45

在 PC 的早期,屏幕抓取工具会模拟终端(例如 IBM 3270)并假装成用户,以便以交互方式提取、更新大型机上的信息。 最近,这个概念被应用于任何通过网页提供界面的应用程序。

随着 SOA 的出现,屏幕抓取成为服务启用非服务应用程序的一种便捷方式。 在这些情况下,网页抓取是更常见的方法。

In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.

With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.

与往事干杯 2024-07-13 15:10:45

这是使用 jQuery 在 Javascript 中实现的一小部分屏幕抓取(请注意,这不是常见的选择,因为抓取通常是客户端-服务器活动):

//Show My SO Reputation Score
var repval = $('span.reputation-score:first'); alert('StackOverflow User "' + repval.prev().attr('href').split('/').pop() + '" has (' + repval.html() + ') Reputation Points.');

如果您运行 Firebug ,复制上面的代码和 将其粘贴到控制台并在此问题页面上查看它的运行情况。

如果 SO 改变了 DOM 结构/元素类名/URI 路径约定,那么所有的赌注都会被取消,它可能不再起作用 - 这是屏幕抓取工作中的常见风险,因为双方(抓取者和被抓取者)之间没有合同/理解[是的,我刚刚发明了一个词])。

Here's a tiny bit of screen scraping implemented in Javascript, using jQuery (not a common choice, mind you, since scraping is usually a client-server activity):

//Show My SO Reputation Score
var repval = $('span.reputation-score:first'); alert('StackOverflow User "' + repval.prev().attr('href').split('/').pop() + '" has (' + repval.html() + ') Reputation Points.');

If you run Firebug, copy the above code and paste it into the Console and see it in action right here on this Question page.

If SO changes the DOM structure / element class names / URI path conventions, all bets are off and it may not work any longer - that's the usual risk in screen scraping endeavors where there is no contract/understanding between parties (the scraper and the scrapee [yes I just invented a word]).

指尖凝香 2024-07-13 15:10:45

从技术上讲,屏幕抓取是指抓取另一个程序的显示数据并将其摄取以供自己使用的任何程序。在 PC 的早期,屏幕抓取会模拟终端(例如 IBM 3270)并假装是用户,以便进行交互提取、更新主机上的信息。 最近,这个概念被应用于任何通过网页提供界面的应用程序。

随着 SOA 的出现,屏幕抓取成为服务启用非服务应用程序的一种便捷方式。 在这些情况下,网页抓取是更常见的方法。

通常,屏幕截图是指解析目标网站的 HTML 页面以提取格式化数据的 Web 客户端。 当网站不提供 RSS 提要或用于以编程方式访问数据的 REST API 时,就会发生这种情况。

通常,您有一个 HTML 页面,其中包含一些您想要的数据。 您所做的就是编写一个程序来获取该网页并尝试提取该数据。 这可以通过 XML 解析器来完成,但对于简单的应用程序,我更喜欢使用正则表达式来匹配 HTML 中的特定位置并提取必要的数据。 不过,有时创建一个好的正则表达式可能会很棘手,因为周围的 HTML 在文档中出现多次。 您始终希望将唯一的项目尽可能接近您所需的数据。

屏幕抓取是当没有人为您提供合理的机器可读界面时您所做的事情。 很难写,而且很脆弱。

作为一个例子,考虑一个 RSS 聚合器,然后考虑通过正常的以人为本的博客界面获取相同信息的代码。 当博主决定改变他们的布局时,哪一个会被破坏。

用于此目的的库的一个示例是 Hpricot for Ruby,它是用于屏幕抓取的架构更好的 HTML 解析器之一。

Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.

With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.

Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.

Typically You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.

Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.

As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout.

One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.

手心的海 2024-07-13 15:10:45

您有一个 HTML 页面,其中包含一些您想要的数据。 您所做的就是编写一个程序来获取该网页并尝试提取该数据。 这可以通过 XML 解析器来完成,但对于简单的应用程序,我更喜欢使用正则表达式来匹配 HTML 中的特定位置并提取必要的数据。 不过,有时创建一个好的正则表达式可能会很棘手,因为周围的 HTML 在文档中出现多次。 您始终希望将唯一的项目尽可能接近您所需的数据。

You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.

泛滥成性 2024-07-13 15:10:45

屏幕抓取是当没有人为您提供合理的机器可读界面时您所做的事情。 它很难写,而且很脆弱。

不完全正确。 当我说大多数开发人员没有足够的经验来编写像样的 API 时,我认为我并不夸张。 我曾与屏幕抓取公司合作过,API 常常存在很多问题(从神秘的错误到糟糕的结果),并且通常不提供网站提供的完整功能,因此最好进行屏幕抓取(如果您愿意,可以使用网络抓取)将要)。 与 API 客户端相比,更多的客户/经纪人使用外联网/网站门户,因此得到了更好的支持。 在大公司中,对外联网门户等的更改很少,通常是因为它最初是外包的,现在只是维护。 我更多地指的是屏幕抓取,其中输出是定制的,例如特定路线和时间的航班、保险报价、运输报价等。

在执行方面,它可以像 Web 客户端一样简单地提取页面内容转换为字符串并使用一系列正则表达式来提取所需的信息。

string pageContents = new WebClient("www.stackoverflow.com").DownloadString();
int numberOfPosts = // regex match

显然,在大规模环境中,您将编写比上面更健壮的代码。

屏幕抓取工具下载 html
页面,并提取数据
有兴趣通过搜索
已知标记或将其解析为 XML 或
一些这样的。

理论上,这是比正则表达式更干净的方法……但实际上,它并不那么容易,因为大多数文档都需要标准化为 XHTML,然后才能通过 XPath 进行处理,最后我们发现经过微调的正则表达式是更实用。

Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.

Not quite true. I don't think I'm exaggerating when I say that most developers do not have enough experience to write decents APIs. I've worked with screen scraping companies and often the APIs are so problematic (ranging from cryptic errors to bad results) and often don't give the full functionality that the website provides that it can be better to screen scrape (web scrape if you will). The extranet/website portals are used my more customers/brokers than API clients and thus are better supported. In big companies changes to extranet portals etc.. are infrequent, usually because it was originally outsourced and now its just maintained. I refer more to screen scraping where the output is tailored, e.g. a flight on particular route and time, an insurance quote, a shipping quote etc..

In terms of doing it, it can be as simple as web client to pull the page contents into a string and using a series of regular expressions to extract the information you want.

string pageContents = new WebClient("www.stackoverflow.com").DownloadString();
int numberOfPosts = // regex match

Obviously in a large scale environment you'd be writing more robust code than the above.

A screen scraper downloads the html
page, and pulls out the data
interested either by searching for
known tokens or parsing it as XML or
some such.

That is cleaner approach than regex... in theory.., however in practice its not quite as easy, given that most documents will need normalized to XHTML before you can XPath through it, in the end we found the fine tuned regular expressions were more practical.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文