A robots.txt file is an implied license, especially since you are aware of it. Thus, continuing to scrape their site could be seen as unauthorized access (i.e., hacking). Sucks, but arguments like this have been made in other legal cases recently (not directly related to robots.txt, but in relation to other "passive controls".)
Grabbing prices violates no copyright law, including DMCA, since copyright does not include factual information, only creative.
Ethically, you should not grab prices because the vendor should have the ability to change prices without worrying about being accused of a bait/switch by people coming from your site.
Have you taken the high road, explaining the site to them and saying you'd love to include them in your list of vendors? Maybe they will love the idea and actually expose the data in a way that is easy for you to consume and less resource-intensive for them to produce.
There are no laws written directly about robots.txt because netiquette is generally followed. Don't be one of the "bad guys."
Some people filter robots because they use URL links to perform "actions" like adding things to carts, and robots leave them with massive numbers of abandoned shopping carts in their database.
Some people filter robots because they have exclusive prices that they can't advertise openly based on agreements with their vendors. You could be putting them in a bad position by exposing those prices on your site.
In this economy, if a company doesn't want to do everything possible to advertise themselves, it's their own fault that you don't include them.
The other use of robots.txt is to help protect web spiders from themselves. It's relatively easy for a web spider to get mired in an infinitely deep forest of links, and a properly constructed robots.txt file will tell the spider that "you don't need to go here".
Many people have tried to build businesses off building "price comparison" engines that scraped major sites.
Once you start getting any sort of traffic/revenue to speak of, you will receive a cease and desist. It's happened to dozens, if not hundreds of projects. I even worked on a small project that received a C&D from Craigslist.
You know how they say "It's easier to ask forgiveness than it is to get permission"? It doesn't hold true with page scraping. Get permission, or you will be hearing from their lawyers.
If you're lucky, it'll be early on, when you've got nothing to lose. If it's late, you may lose your business and all your work overnight, with a single letter.
Getting permission shouldn't be hard. Unless you're doing something sneaky, you're likely going to drive them additional traffic. Hell, once your product takes off, sites may be begging you, or even paying you to add their data.
还记得 Cuil 的机器人被指控过度行为(在某些情况下表现得像 DoS 攻击)并耗尽一些小型站点的带宽限额时引起的轩然大波吗?
如果太多人违反 robots.txt,我们可能会遇到更糟糕的情况。
One reason we allow robots to dig through the web without complaint is that we have a way to stop them if we want to. Protects both sides.
Remember the uproar when Cuil's robots were accused of going over-the-top, apparently acting like a DoS attack in some cases and using up bandwidth allowances of some small sites?
If too many people violate robots.txt we might get something worse.
To answer the narrow question, for the price comparison website you're probably best grabbing the price in real time, rather then scrapping the database in advance. Hard to imagine that being a problem.
On the narrow issue: If a seller says that their prices are secret, I think you have to respect that. I'd contact them and ask if they really don't want price comparison engines like yours to include them, or if the "no trespassing" sign is for technical reasons. If the latter, perhaps they'll provide you with an alternative. If the former, then I'd say too bad, they don't get included, they lose some business, and it's their problem.
Tangential rant: Personally, I get pretty annoyed with companies that make me jump through hoops to find out the price of their products, places that make me call and talk to a salesman so he can give me a hard-sell pitch, or worse, make me give them my phone number so their salesman can call and harass me. I figure that if they're afraid to tell me the price, it probably means that it's too high.
In general: A robots.txt file is like a "No Trespassing" sign. It's the owner's right to say who is allowed on their property. If you think their reasons are dumb, you can politely suggest they take the sign down. But you don't have the right to disregard their wishes. If someone puts a No Trespassing sign on his yard, and I say, "Hey, I just want to take a quick short cut, what's the big deal?" -- Maybe I'm stepping on his prized Bulgarian violet bulbs and destroying a valuable investment. Maybe I'm crossing his people's sacred burial ground and offending their religious sensibilities. Or maybe he's just an ornery jerk. But it's still his property and his right. Oh, and if I fall into the dangerous sinkhole after ignoring the No Trespassing sign, who's to blame? (In America, I could probably still sue him for all he's worth despite the fact that he warned me, but is that right?)
I'm showing some ignorance here, but I always thought a bot was something only sent out by a search engine. Like Google or Yahoo.
Thus, if you wrote an application that searched content on the internet, I wouldn't consider that a search engine bot, which to my knowledge is what robots.txt is trying to block.
But this may just be selective ignorance, because I might do it until the webmaster of that site contacted me and asked me to stop :)
If people make it available to public access, they shouldn't try to put limits on it. Adding a robots.txt file to your site is the equivalent to putting a sign on your lawn that says "Please don't look at me."
发布评论
评论(10)
参数:
Arguments:
robots.txt
的另一个用途是帮助保护网络蜘蛛免受自身侵害。 对于网络蜘蛛来说,陷入无限深的链接森林相对容易,正确构建的 robots.txt 文件会告诉蜘蛛“你不需要去这里”。The other use of
robots.txt
is to help protect web spiders from themselves. It's relatively easy for a web spider to get mired in an infinitely deep forest of links, and a properly constructedrobots.txt
file will tell the spider that "you don't need to go here".许多人试图通过构建“价格比较”引擎来开展业务,这些引擎会抢占主要网站。
一旦您开始获得任何类型的流量/收入,您将收到停止。 这种情况发生在数十个甚至数百个项目中。 我什至参与了一个小项目,该项目获得了 Craigslist 的 C&D。
你知道他们怎么说“请求宽恕比获得许可更容易”吗? 它不适用于页面抓取。 获得许可,否则您将收到他们律师的来信。
如果你幸运的话,那还早,那时你就没什么可失去的了。 如果晚了,你可能会因为一封信而一夜之间失去你的生意和所有的工作。
获得许可应该不难。 除非您偷偷摸摸地做一些事情,否则您很可能会为他们带来额外的流量。 天啊,一旦你的产品起飞,网站可能会乞求你,甚至付钱给你添加他们的数据。
Many people have tried to build businesses off building "price comparison" engines that scraped major sites.
Once you start getting any sort of traffic/revenue to speak of, you will receive a cease and desist. It's happened to dozens, if not hundreds of projects. I even worked on a small project that received a C&D from Craigslist.
You know how they say "It's easier to ask forgiveness than it is to get permission"? It doesn't hold true with page scraping. Get permission, or you will be hearing from their lawyers.
If you're lucky, it'll be early on, when you've got nothing to lose. If it's late, you may lose your business and all your work overnight, with a single letter.
Getting permission shouldn't be hard. Unless you're doing something sneaky, you're likely going to drive them additional traffic. Hell, once your product takes off, sites may be begging you, or even paying you to add their data.
我们允许机器人毫无怨言地挖掘网络的原因之一是,如果我们愿意的话,我们有办法阻止它们。 保护双方。
还记得 Cuil 的机器人被指控过度行为(在某些情况下表现得像 DoS 攻击)并耗尽一些小型站点的带宽限额时引起的轩然大波吗?
如果太多人违反 robots.txt,我们可能会遇到更糟糕的情况。
One reason we allow robots to dig through the web without complaint is that we have a way to stop them if we want to. Protects both sides.
Remember the uproar when Cuil's robots were accused of going over-the-top, apparently acting like a DoS attack in some cases and using up bandwidth allowances of some small sites?
If too many people violate robots.txt we might get something worse.
“不就是不”。
"No" means "no".
要回答这个狭隘的问题,对于价格比较网站,您可能最好实时获取价格,而不是提前废弃数据库。 很难想象这是一个问题。
To answer the narrow question, for the price comparison website you're probably best grabbing the price in real time, rather then scrapping the database in advance. Hard to imagine that being a problem.
简短的回答:不。
就狭隘的问题而言:如果卖家说他们的价格是秘密的,我认为你必须尊重这一点。 我会联系他们并询问他们是否真的不希望像您这样的价格比较引擎包含它们,或者“禁止侵入”标志是否出于技术原因。 如果是后者,也许他们会为您提供替代方案。 如果是前者,那么我会说太糟糕了,他们没有被纳入其中,他们失去了一些业务,这是他们的问题。
切题的咆哮:就我个人而言,我对那些让我费尽心思去了解他们产品价格的公司感到非常恼火,那些地方让我打电话并与销售人员交谈,以便他可以向我进行强行推销,或者更糟,让我给他们我的电话号码,这样他们的推销员就可以打电话骚扰我。 我想如果他们不敢告诉我价格,可能意味着价格太高了。
一般来说:robots.txt 文件就像一个“禁止侵入”标志。 业主有权决定谁可以进入其财产。 如果您认为他们的理由很愚蠢,您可以礼貌地建议他们取下标牌。 但你没有权利无视他们的意愿。 如果有人在他的院子上挂了“禁止侵入”的牌子,我说:“嘿,我只是想抄近路,有什么大不了的?” ——也许我踩到了他珍贵的保加利亚紫罗兰球茎,毁掉了一项宝贵的投资。 也许我正在穿越他人民的神圣墓地并冒犯他们的宗教情感。 或者也许他只是一个脾气暴躁的混蛋。 但这仍然是他的财产和权利。 哦,如果我在无视“禁止侵入”标志后掉入危险的天坑,谁该负责? (在美国,尽管他警告过我,但我可能仍然可以起诉他,但这是对的吗?)
Short answer: No.
On the narrow issue: If a seller says that their prices are secret, I think you have to respect that. I'd contact them and ask if they really don't want price comparison engines like yours to include them, or if the "no trespassing" sign is for technical reasons. If the latter, perhaps they'll provide you with an alternative. If the former, then I'd say too bad, they don't get included, they lose some business, and it's their problem.
Tangential rant: Personally, I get pretty annoyed with companies that make me jump through hoops to find out the price of their products, places that make me call and talk to a salesman so he can give me a hard-sell pitch, or worse, make me give them my phone number so their salesman can call and harass me. I figure that if they're afraid to tell me the price, it probably means that it's too high.
In general: A robots.txt file is like a "No Trespassing" sign. It's the owner's right to say who is allowed on their property. If you think their reasons are dumb, you can politely suggest they take the sign down. But you don't have the right to disregard their wishes. If someone puts a No Trespassing sign on his yard, and I say, "Hey, I just want to take a quick short cut, what's the big deal?" -- Maybe I'm stepping on his prized Bulgarian violet bulbs and destroying a valuable investment. Maybe I'm crossing his people's sacred burial ground and offending their religious sensibilities. Or maybe he's just an ornery jerk. But it's still his property and his right. Oh, and if I fall into the dangerous sinkhole after ignoring the No Trespassing sign, who's to blame? (In America, I could probably still sue him for all he's worth despite the fact that he warned me, but is that right?)
涉及哈佛合作社的故事的一个有趣的现实版本:
Coop 就 ISBN 复印机报警。
An interesting IRL version of story involving The Harvard Coop:
Coop Calls Cops On ISBN Copiers.
我在这里表现出一些无知,但我一直认为机器人只是由搜索引擎发送的东西。 就像谷歌或雅虎一样。
因此,如果您编写了一个在互联网上搜索内容的应用程序,我不会认为这是一个搜索引擎机器人,据我所知,这是 robots.txt 试图阻止的。
但这可能只是选择性无知,因为我可能会这样做,直到该网站的网站管理员联系我并要求我停止:)
I'm showing some ignorance here, but I always thought a bot was something only sent out by a search engine. Like Google or Yahoo.
Thus, if you wrote an application that searched content on the internet, I wouldn't consider that a search engine bot, which to my knowledge is what robots.txt is trying to block.
But this may just be selective ignorance, because I might do it until the webmaster of that site contacted me and asked me to stop :)
如果人们将其提供给公众访问,他们就不应该试图对其施加限制。 将 robots.txt 文件添加到您的网站相当于在您的草坪上放置一个标牌,上面写着“请不要看我”。
If people make it available to public access, they shouldn't try to put limits on it. Adding a robots.txt file to your site is the equivalent to putting a sign on your lawn that says "Please don't look at me."