提供 API 是否有助于阻止屏幕抓取?
最近我一直在思考屏幕抓取以及它可能是一项什么样的任务。 所以我提出以下问题。
作为网站开发人员,您会公开简单的 API 来防止用户抓取屏幕(例如 JSON 结果)吗?
然后,这些结果可以实现缓存,并且它们的流量比可能下载的大量标记要小得多。
我不是在考虑预防,而是在阻止刮擦。
抓取带宽示例
((用户 * (% / 100)) * ((频率 * 60) * 24)) * 文件大小
- 用户:200,000
- 使用实用程序的用户百分比:5
- 文件大小:1kb
- 频率:1 分钟
公式:
((用户 * (% / 100 )) * ((freq * 60) * 24)) * 文件大小
10,000 * 1440 * 1
14400000kb 或 13.73291015625gb
假设您的 JSON 结果是 200 字节,现在是 (10,000 * 1440 * 0.2) 或 2.74658203125gb 一天。
每天大约有 11GB 的流量变化。
我的 Stack Overflow 资料是 96k,仅供参考。
出现此问题的原因是要求从用户配置文件中获取 JSON 结果:
http://stackoverflow.uservoice.com/ page/general/suggestions/101342-add-json-for-user-information
我想了解其他开发人员是否会公开此类 API,以及是否值得您花时间提供这些 API 以减少带宽。
I have been thinking quite a bit here lately about screen scraping and what a task it can be. So I pose the following question.
Would you as a site developer expose simple APIs to prevent users from screen scraping, such as JSON results?
These results could then implement caching, and they are much smaller for traffic than the huge amounts of markup that could potentially be downloaded.
I am not looking at prevention, but deterring scraping.
Scraping Bandwidth Sample
((users * (% / 100)) * ((freq * 60) * 24)) * filesize
- users: 200,000
- % of users using utility: 5
- filesize: 1kb
- freq: 1 minute
Formula:
((users * (% / 100)) * ((freq * 60) * 24)) * filesize
10,000 * 1440 * 1
14400000kb or 13.73291015625gb
Assuming your JSON result is 200 bytes that's now (10,000 * 1440 * 0.2) or 2.74658203125gb a day.
That's a change of about 11gb of traffic a day.
My Stack Overflow profile is 96k for reference.
The reason for this question prompted asking for a JSON result from users profiles:
http://stackoverflow.uservoice.com/pages/general/suggestions/101342-add-json-for-user-information
I wanted to find out if other developers would expose this type of API, and if it is worth your time to provide these APIs to reduce bandwidth.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
屏幕抓取实际上是无法预防的。 提供 API 虽然对那些使用您的数据的人来说很好,但无法阻止它。 由于数据最终必须是人类可读的,因此它是机器可读的。 您最好将精力花在您的网站上,而不是为那些会使用您的数据(合法或非法)的人工作。
wget、perl、正则表达式是抓取数据的常用机制。
Screen scraping is not realistically preventable. Providing an API, while nice to those who consume your data, can't prevent it. Since the data ultimately has to be human readable, it therefore is machine readable. You would be better off spending your energy working on your site and not working for those who would consume your data (legally or not).
wget, perl, regular expressions is the common mechanism for scraping data.
如果您想鼓励人们与您的网站集成,或者您的网站足够流行,以至于成为一个问题(因此您被迫允许人们与其集成),那么请务必提供 API。 如果你的 API 足够并且易于使用,那么人们会更喜欢它而不是屏幕抓取。 如果您的 API 不充分或者比屏幕抓取工具更难使用,那么您可能仍然会遇到问题。
If you want to encourage people to integrate with your site or it is popular enough for this to be a problem (so that you are forced to allow people to integrate with it), then by all means provide an API. If your API is adequate and easy to use, then people will prefer it to screen scraping. If your API is inadequate or harder to use than a screen scraper then you may still have the problem.
如果技术用户使用 API 比屏幕抓取更容易,他们就会这样做。 更好的是,如果您可以鼓励人们使用您的 API 而不是屏幕抓取,那么您应该可以更轻松地监控流量,因为自动用户代理与浏览器用户有明显的区别。代理。
RESTful JSON 接口是一个不错的选择,因为它可以相当容易地从任何其他语言编写脚本(向我展示一种没有 JSON 解析器的语言,我将向你展示一种没人关心的语言)。
If it's easier for technical users to use an API than it is for them to screen-scrape, they will do so. Even better so, if you can encourage people to use your APIs instead of screen-scraping, you should have a much easier time monitoring traffic, because the automated user-agents are clearly distinguished from the browser user-agents.
A RESTful JSON interface is a good choice, because it can be scripted from any other language fairly easy (show me a language that doesn't have a JSON parser and I'll show you a language nobody cares about).
大多数开发人员会出于自己的原因选择要使用的技术。 因此,如果您提供的 API 比他们用来抓取屏幕的 API 更容易,那么一些未知的百分比将转移到它。 带宽减少可能在他们的考虑因素列表中排得很靠后。
由于您没有指定抓取的目的,我们无法帮助您猜测要提供哪种 API,或者使用它的比例是多少。
一种非常常见且难以改变的抓取工具是使用 Excel 或其他一些可以轻松抓取的产品。
如果您的目的只是为了最大程度地减少痛苦(可能会从您的问题中推断出这一点),那么到目前为止最有用的事情就是查询它们的抓取工具 - 无论如何,比查询 SO 更有用。
您可以检查 woot.com 并查看他们在 RSS 提要上提供的内容,以减轻 Web http 服务器的负担。
Most developers will choose the technology to use for their own reasons. So if you provide an API that is easier than what they use to scrape your screens, then some unkown percent will move to it. Bandwidth reduction will probably be very low on their lists of considerations.
Since you haven't specified what's being scraped for, we can't help you guess what kind of API to provide, or what proportion will use it.
One very common tool for scraping that's hard to deflect is using Excel or some other product that makes the scrape painless.
If your intent is just to minimize the pain (which one might infer from your question), then by far the most useful thing to do is to query they scrapers - more useful than querying SO, anyway.
You might check woot.com and see what they provide on an RSS feed to unburden the web http server.
如果您想提供一个开放模型,让人们可以在您的网站上开发解决方案,那么您应该提供一个 API。 屏幕抓取是一种敌对集成方法,仅应在最后手段下使用。
If you want to provide an open model that people can develop solutions ontop of your site, then yes you should provide an API. Screen scraping is a method of hostile integration, and should only be used in last resort.
提供 API 肯定会减少针对您的网站进行的屏幕抓取量。 使用良好的 REST API 比屏幕抓取更容易、更安全。 屏幕可能会在没有通知的情况下发生变化,这使得屏幕抓取代码更难维护。 作为一名开发人员,如果我需要来自某个网站的信息,并且可以通过 API 获得相同的信息,那么我绝不会抓取该网站。
Providing an API should definitely reduce the amount of screen scraping that gets done against your site. Using a good REST API is much easier and safer than screen scraping. Screens can change without notice, and that makes screen scraping code much harder to maintain. As a developer, if I need information from a site, I'd never scrape the site if the same information was available through an API.