如何最好地代表第三方屏幕抓取受密码保护的网站?
我想编写一个程序来分析您的梦幻棒球队并通知您建议的操作,可能每天多次。 问题是,您不是在我的网站上玩梦幻棒球,而是在 yahoo、cbs 或 espn 等网站上玩。
在大多数这些网站上,梦幻球队和联盟不是公开的,因此您必须登录和联盟成员查看联盟中的球队。
我所需要的只是将每个站点上的团队页面的纯 html 发送到我的服务器,然后我可以在服务器上解析和分析文件并发送用户通知。
问题是,我需要用户名/密码组合才能在需要时轻松地将这些数据获取到我的服务器,而且我认为会有很多人不想委托他们的 yahoo/ espn/cbs 密码给我。
我想出了几种可能的方法来解决这个问题:
最明显的方法是询问他们的团队所在网站的凭据。 然后我可以以编程方式登录并请求我需要的数据。 我猜想很多人会很乐意向我提供他们的凭据,但也有一些人不会那么乐意。
编写一个桌面客户端,然后用户下载该客户端。 客户端需要他们的凭据,但它基本上可以执行与基于服务器的版本完全相同的操作:登录、请求页面并将页面发送回我的服务器。 不同之处在于他们的密码永远不需要离开桌面。 他们的计算机需要打开,并且该程序需要运行才能使该方法发挥作用。
编写浏览器插件来导航到我需要的页面,使用上次登录时保存的 cookie 来登录网站,并将页面发送回我的服务器。 这并不要求我的软件询问他们的密码,但如果 cookie 过期,我就会崩溃,而且我对浏览器附加组件了解不多。
我确信还有其他选择,但这些是我迄今为止想到的。
我有两个问题: 1. 此类任务还有哪些其他可能性? 2. 我是否高估了人们不愿意向我提供 yahoo(例如)密码的情况? 上面的选项(1)是显而易见的选择吗?
评论中建议我尝试使用雅虎管道,这看起来是一个很有希望的建议,所以我对此进行了一些探索。 现在看了这个,我认为这不是一个选择。 所以,看来我会选择选项 1。
I want to write a program that analyzes your fantasy baseball team and notifies you of recommended actions, possibly multiple times per day. The problem is, you aren't playing fantasy baseball on my site, you're playing on yahoo, or cbs, or espn, etc.
On the majority of these sites, fantasy teams and leagues are not public, so you must be logged in and a member of the league to see the teams in the league.
All that I need is the plain html for the team page on each of those sites to be sent to my server, where I can then parse and analyze the file and send user notifications.
The problem is that I need username/password combinations to easily get this data to my server when I need it, and I think there will be a lot of people who wouldn't want to entrust their yahoo/espn/cbs password to me.
I have come up with several possible ways to solve this problem:
The most obvious way is to ask for their credentials for the site on which their team is hosted. Then I could just programmatically log in and request the data I need. I'm guessing a number of people would be comfortable giving me their credentials, and a number of them not so much.
Write a desktop client, which the user then downloads. The client would require their credentials, but it could then basically do exactly the same thing that the server based version would do, log in, request the page, and send the page back to my server. The difference being that their password would never need to leave their desktop. Their computer would need to be on, and this program running for this method to work.
Write browser add-ons that navigate to the page I need, use the cookie that is saved from a previous login to login to the site, and send the page back to my server. This doesn't require my software to ever ask for their password, but if the cookie expires I am hosed, and I don't know much about browser add-ons besides.
I'm sure there are other options, but these are what I've come up with so far.
I have two questions:
1. What are the other possibilities for this type of task?
2. Am I over-estimating people's reluctance to give me their yahoo (for example) password? Is option (1) above the obvious choice?
It was suggested in the comments that I try yahoo pipes, and that looked like a promising suggestion so I explored it a bit. Having looked now at this, I don't think that is an option. So, it looks like I'll be going with option 1.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是几年前我想做同样的事情时遇到的一个问题。 我们的网站是 http://benchcoach.com,我们考虑的选项如下:
最初,我们考虑获取用户的凭据和登录。 然后我们会登录并抓取他们的联赛和球队信息。 问题是,在阅读了几条不同的服务条款后,这肯定会违反服务条款。 除此之外,雅虎! 绝对是我们正在考虑的网站之一,他们的用户有电子邮件(我们可以在其中访问敏感数据),并且雅虎! 钱包。 此外,对于 Yahoo/ESPN/CBS 来说,通过 IP 地址阻止我们的程序化登录是非常简单的。
我们确定的解决方案(不是 100% 满意,但它似乎确实有效)是要求我们的用户安装一个书签(例如好吃、digg 或 reddit),它将当前的 html 页面发布到我们的服务器,我们可以在其中解析数据并加载我们的数据库。 如果他们仍然登录到他们的 Yahoo/ESPN/CBS 帐户,我们会将他们直接定向到页面,否则,这些网站将提示进行身份验证。 再次单击小书签,会将页面发布到我们的服务器。
这种方法的优点是我们从未收集任何人的凭据,因此可以减轻任何安全担忧。 其次,这将使雅虎/ESPN/CBS 无法阻止对我们服务的访问,因为我们永远不会直接连接到他们的服务器,而是用户的浏览器会将其浏览器的内容发布到我们的服务器。
这样做的问题是需要点击两次才能将页面发布到我们的网站。 对于头对头联赛,我们需要 3-4 个页面,因此用户需要点击 6-8 次才能将他们的联赛同步到我们的服务器。 我们仍在寻找这方面的选择。
一个重要的说明是,我在一年前的一次会议上遇到了雅虎幻想足球网站的产品经理。 我们讨论了如何获取雅虎数据,他确认获取凭据将违反他们的服务条款,他们可能会阻止我们。 虽然我不认为他们会这么做,但他们很难投入时间和精力来开发这个功能,结果却让他们封锁了我们的网站,并通过关闭用户的帐户来激怒用户。
This is a problem I grappled with a couple of years ago when I wanted to do the same thing. Our site is http://benchcoach.com and the options we were considering were the following:
Original we considered getting the user's credentials and login. We would then log in and scrape their league and team info. The problem there is that after reading several of the various terms of service, this would definitely be violating the terms of service. On top of this, Yahoo! was definitely one of the sites we were considering and their users have email (where we could get access to sensitive data), and Yahoo! wallet. In addition, it would be pretty trivial for Yahoo/ESPN/CBS to block our programmatic logins by IP Address.
The solution we settled on (not 100% happy but it does seem to work) was asking our users to install a bookmarklet (like delicious, digg or reddit) which would post the current html page to our servers, where we could parse the data and load our database. If they were still logged into their Yahoo/ESPN/CBS account, we would direct them directly to the pages, otherwise, those sites would prompt for authentication. Clicking the bookmarklet once more, would post the page to our servers.
The pros of this approach was that we never collected anyone's credentials so any concern of security would have been alleviated. Secondly, it would make it impossible for Yahoo/ESPN/CBS to block access to our service since we would never be connecting directly to their servers but rather the user's browser would be posting the contents of their browser to our server.
The problems with this is that it takes 2 clicks to post a page to our site. For head to head leagues, we needed 3-4 pages so it would take our user 6-8 clicks to sync their league to our servers. We're still looking at options for this.
One important note is that I ran into the product manager of the Yahoo Fantasy Football site at a conference a year ago. We talked about how we were getting the Yahoo data, and he confirmed that getting credentials would violate their TOS and they may stop us. While I don't think they would have, it would have made it hard to invest time and energy to develop this only to have them block our site and pissing of users by closing their accounts.
一个可能更复杂的答案可能可以通过(例如)雅虎管道来完成。
假设,您创建一个管道,提示用户输入凭据,并为他们提供包含抓取数据的 url。 他们在自己的网站中输入此 URL,而无需直接提供其凭据。 更好的是,对于具有安全意识的人来说,可以在输入任何信息之前检查管道实际上在做什么。
缺点是复杂性增加(并且您必须编写和维护管道)。 话虽如此,您可以提供直接从您的站点到已发布管道的链接,以使事情尽可能简单。
A potentially more complicated answer could possibly be done with (for example) yahoo pipes.
Hypothetically, you create a pipe which prompts the user for their credentials and provides them with a url which contains their scraped data. They enter this URL in their site, and never have to provide their credentials directly. Even better, for the security-conscious, it would be possible to examine what the pipe was actually doing before entering any information.
The downside would be increased complexity (as well as you'd have to write and maintain the pipe). Having said that, you could provide a link directly to the published pipe from your site, to make things as easy as possible.
选项 1 是显而易见的选择。 信任您网站的人将提供详细信息。 您没有其他方法可以在屏幕抓取时登录其他网站。
Option 1 is the obvious choice. People who trust your site will provide the details. There is no other way you can login to other site while screen scraping.