寻找何时屏幕抓取可能值得的示例
屏幕抓取似乎是一个有用的工具 - 您可以进入其他人的网站并窃取他们的数据 - 太棒了!
但我很难想象这有多大用处。
即使在网络上,大多数应用程序数据也非常特定于该应用程序。 例如,假设我从 StackOverflow 上抓取了所有问题和答案,或者从 Google 上抓取了所有结果(假设这是可能的)——我留下的数据不是很有用,除非我有一个竞争问题并且答案网站(在这种情况下,被盗的数据将立即显而易见)或竞争的搜索引擎(在这种情况下,除非我有自己的算法,否则我的数据很快就会过时)。
所以我的问题是,在什么情况下来自一个应用程序的数据对某些外部应用程序有用? 我正在寻找一个实际的例子来说明这一点。
Screen scraping seems like a useful tool - you can go onto someone else's site and steal their data - how wonderful!
But I'm having a hard time with how useful this could be.
Most application data is pretty specific to that application even on the web. For example, let's say I scrape all of the questions and answers off of StackOverflow or all of the results off of Google (assuming this were possible) - I'm left with data that is not very useful unless I either have a competing question and answer site (in which case the stolen data will be immediately obvious) or a competing search engine (in which case, unless I have an algorithm of my own, my data is going to be stale pretty quickly).
So my question is, under what circumstances could the data from one app be useful to some external app? I'm looking for a practical example to illustrate the point.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
当站点公开提供(仍然)无法作为 XML 服务提供的数据时,它非常有用。 我有一位客户使用抓取将航班跟踪数据提取到他公司的一个内联网应用程序中。
该技术也用于研究。 我有一个客户想要通过词性比较几个在线词典的内容,所有这些网站都必须被删除。
它不是“窃取”数据的技术。 所有普通使用限制均适用。 许多网站都实施验证码机制来防止抓取,但解决这些问题是不合适的。
It's useful when a site publicly provides data that is (still) not available as an XML service. I had a client who used scraping to pull flight tracking data into one of his company's intranet applications.
The technique is also used for research. I had a client who wanted to compare the contents of several online dictionaries by part of speech, and all of these sites had to be scraped.
It is not a technique for "stealing" data. All ordinary usage restrictions apply. Many sites implement CAPTCHA mechanisms to prevent scraping, and it is inappropriate to work around these.
StackOverflow 就是一个很好的例子 - 无需抓取数据,因为他们已经发布了数据 根据 CC 许可证。 社区已经在处理统计数据并创建有趣的图表。
ProgrammableWeb 上有一大堆流行的混搭示例。 您甚至可以在 BarCamps 和 黑客日(带上睡袋)。 查看 Yahoo API(特别是 Pipes)并查看开发人员正在用它做什么。
不要窃取和重新发布,而是利用数据构建更好的东西 - 理解、搜索或探索数据的新方法。 始终引用您的数据来源并感谢那些帮助过您的人。 用它来学习新语言或理解数据或帮助推广语义网。 请记住,这是为了好玩而不是为了盈利!
希望有帮助:)
A good example is StackOverflow - no need to scrape data as they've released it under a CC license. Already the community is crunching statistics and creating interesting graphs.
There's a whole bunch of popular mashup examples on ProgrammableWeb. You can even meet up with fellow mashupers (O_o) at events like BarCamps and Hack Days (take a sleeping bag). Have a look at the wealth of information available from Yahoo APIs (particularly Pipes) and see what developers are doing with it.
Don't steal and republish, build something even better with the data - new ways of understanding, searching or exploring it. Always cite your data sources and thank those who helped you. Use it to learn a new language or understand data or help promote the semantic web. Remember it's for fun not profit!
Hope that helps :)
如果网站拥有可以通过 API 访问的数据(而且这样做是免费且合法的),但他们只是还没有实现,那么屏幕抓取本质上是为您自己创建该功能的一种方式。
实际示例——屏幕抓取将允许您创建某种混搭,将来自整个 SO 系列网站的信息结合起来,因为目前还没有 API。
If the site has data that would benefit from being accessible through an API (and it would be free and legal to do so), but they just haven't implemented one yet, screen scraping is a way of essentially creating that functionality for yourself.
Practical example -- screen scraping would allow you to create some sort of mashup that combines information from the entire SO family of sites, since there's currently no API.
好吧,从大型机收集数据。 这就是有些人使用屏幕抓取的原因之一。 大型机仍在金融界使用,并且通常运行上个世纪编写的软件。 编写它的人可能已经退休了,并且由于该软件对于这些组织非常重要,因此当需要添加一些新代码时,他们真的很讨厌它。 因此,屏幕抓取提供了一个与大型机通信的简单界面,以从大型机收集信息,然后将其发送到需要此信息的任何进程。
你说重写大型机应用程序? 嗯,大型机上的软件可能非常旧。 我见过大型机上已有 30 多年历史的软件,是用 COBOL 编写的。 通常,这些应用程序运行得很好,公司不想冒险重写某些部分,因为这可能会破坏一些已经运行了 30 多年的代码! 如果东西没有损坏,请不要修理它们。 当然,可以编写额外的代码,但大型机代码在生产环境中使用需要很长时间。 经验丰富的大型机开发人员很难找到。
我自己也必须在软件项目中使用屏幕抓取。 这是一个调度应用程序,必须将其启动的每个子进程的输出捕获到控制台。 实际上,这是最简单的屏幕抓取形式,许多人甚至没有意识到,如果您将一个应用程序的输出重定向到另一个应用程序的输入,那么它仍然是一种屏幕抓取。 :)
基本上,屏幕抓取允许您将一个(网络)应用程序与另一个应用程序连接。 它通常是一种快速解决方案,当其他解决方案花费太多时间时使用。 每个人都讨厌它,但它节省的时间仍然使其非常高效。
Well, to collect data from a mainframe. That's one reason why some people use screen scraping. Mainframes are still in use in the financial world and often it's running software that has been written in the previous century. The people who wrote it might already be retired and since this software is very critical for these organizations, they really hate it when some new code needs to be added. So, screenscraping offers an easy interface to communicate with the mainframe to collect information from the mainframe and then send it onwards to any process that needs this information.
Rewrite the mainframe application, you say? Well, software on mainframes can be very old. I've seen software on mainframes that was over 30 years old, written in COBOL. Often, those applications work just fine and companies don't want to risk rewriting parts because it might break some code that had been working for over 30 years! Don't fix things if they're not broken, please. Of course, additional code could be written but it takes a long time for mainframe code to be used in a production environment. And experienced mainframe developers are hard to find.
I myself had to use screen scraping too in a software project. This was a scheduling application which had to capture the output to the console of every child process it started. It's the simplest form of screen scraping, actually, and many people don't even realize that if you redirect the output of one application to the input of another, that it's still a kind of screen scraping. :)
Basically, screen scraping allows you to connect one (web) application with another one. It's often a quick solution, used when other solutions would cost too much time. Everyone hates it, but the amount of time it saves still makes it very efficient.
任何时候您需要计算机来读取网站上的数据。 屏幕抓取的用途与任何网站 API 的用途完全相同。 然而,有些网站没有资源自行创建 API; 屏幕抓取是开发人员解决这个问题的方法。
例如,在 Stack Overflow 的早期,有人构建了一个工具来跟踪您的声誉随时间的变化,然后 Stack Overflow 本身提供了该功能。 由于 Stack Overflow 没有 API,唯一的方法就是屏幕抓取。
Any time you need a computer to read the data on a website. Screen scraping is useful in exactly the same instances that any website API is useful. Some websites, however, don't have the resources to create an API themselves; screen scraping is the developer's way around that.
For instance, in the earlier days of Stack Overflow, someone built a tool to track changes to your reputation over time, before Stack Overflow itself provided that feature. The only way to do that, since Stack Overflow has no API, was to screen scrape.
假设您想从一个流行的体育网站获取分数,该网站不提供 XML 提要或 API 可用的信息。
Let's say you wanted to get scores from a popular sports site that did not offer the information available with an XML feed or API.
对于一个项目,我们找到了一家(便宜的)商业供应商,为特定文件格式提供翻译服务。 该供应商没有提供 API(毕竟,它是一个廉价的供应商),而是提供了一个可供上传和下载的 Web 表单。
每天处理数百个文件,唯一的方法是使用 WWW::Mechanize 在 Perl 中,屏幕抓取登录和上传框的方式,提交文件,并保存返回的文件。 它很丑陋,而且绝对脆弱(如果供应商更改了站点,至少可能会破坏应用程序),但它可以工作。 现在已经工作一年多了。
For one project we found a (cheap) commercial vendor that offered translation services for a specific file format. The vendor didn't offer an API (it was, after all, a cheap vendor) and instead had a web form to upload and download from.
With hundreds of files a day the only way to do this was to use WWW::Mechanize in Perl, screen scrape the way through the login and upload boxes, submit the file, and save the returned file. It's ugly and definitely fragile (if the vendor changes the site in the least it could break the app) but it works. It's been working now for over a year.
我的经历中有一个例子。
我需要为我正在构建的 iPhone 应用程序提供世界各地主要城市的列表及其纬度和经度。 该应用程序将使用该数据以及 iPhone 上的地理定位功能来显示该应用程序的每个用户最接近哪个主要城市(以免显示确切位置),并将它们绘制在 3D 地球仪上。
我无法轻松地在任何地方找到 XML/Excel/CSV 类型格式的适当列表,但我确实找到了 这个维基百科页面(大致)包含我需要的信息。 因此,我编写了一个快速脚本来抓取该页面并将数据加载到数据库中。
One example from my experience.
I needed a list of major cities throughout the world with their latitude and longitude for an iPhone app I was building. The app would use that data along with the geolocation feature on the iPhone to show which major city each user of the app was closest to (so as not to show exact location), and plot them on a 3D globe of the earth.
I couldn't find an appropriate list in XML/Excel/CSV type format anywhere easily, but I did find this wikipedia page with (roughly) the info I needed. So I wrote up a quick script to scrape that page and load the data into a database.
最明显的情况是网络服务不提供反向搜索。 您可以对同一数据集实施反向搜索,但这需要抓取整个数据集。
如果反向搜索也需要大量的预处理,例如因为您需要支持部分匹配,那么这可能是合理的使用。 数据源可能不具备提供反向搜索选项的技术技能或计算资源。
The obvious case is when a webservice doesn't offer reverse search. You can implement that reverse search over the same data set, but it requires scraping the entire dataset.
This may be fair use if the reverse search also requires significant pre-processing, e.g. because you need to support partial matching. The data source may not have the technical skills or computing resources to provide the reverse search option.
我每天使用屏幕抓取,运行一些电子商务网站,并每天运行屏幕抓取脚本,以自动从我的供应商批发网站收集产品列表。 这使我能够获得来自多个供应商的所有产品的最新信息,并允许我标记由于价格变化而导致的不经济利润。
I use screen scraping daily, I run some eCommerce sites and have screen-scraping scripts running daily to gather product lists automatically from my suppliers wholesale sites. This allows me to have upto date information on all the products available to me from several suppliers and allows me to flag non-economical margins due to price changes.