如何防止我的 ASP.NET 网站被屏幕抓取?

发布于 2024-08-30 03:23:16 字数 1564 浏览 6 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

鸵鸟症 2024-09-06 03:23:16

可以尝试检测屏幕抓取工具:

使用 cookie 和计时,这将使那些开箱即用的屏幕抓取工具变得更加困难。还要检查 javascript 支持,大多数抓取工具都没有。检查元浏览器数据以验证它确实是一个网络浏览器。

您还可以在一分钟内检查请求,驾驶浏览器的用户每分钟只能发出少量请求,因此服务器上每分钟检测到太多请求的逻辑可能会假设正在发生屏幕抓取并阻止访问违规 IP 地址一段时间。如果这开始影响爬虫,请记录被阻止的用户 IP,并根据需要开始允许他们的 IP。

您也可以使用 http://www.copyscape.com/ 来保护您的内容,这至少会告诉您谁在重复使用您的数据。

另请参阅此问题:

保护屏幕抓取

另请参阅

http://blockscraping.com/

关于屏幕抓取的好文档:

http://www.realtor.org/wps/wcm/connect/5f81390 048be35a9b1bbff0c8bc1f2ed/scraping_sum_jun_04.pdf?MOD=AJPERES& ;CACHEID=5f81390048be35a9b1bbff0c8bc1f2ed

如何防止屏幕抓取:

http://mvark.blogspot.com/2007/02/how-to-prevent-screen-scraping.html

It is possible to try to detect screen scrapers:

Use cookies and timing, this will make it harder for those out of the box screen scrapers. Also check for javascript support, most scrapers do not have it. Check Meta browser data to verify it is really a web browser.

You can also check for requests in a minute, a user driving a browser can only make a small number of requests per minute, so logic on the server that detects too many requests per minute could presume that screen scraping is taking place and prevent access from the offending IP address for some period of time. If this starts to affect crawlers, log the users ip that is blocked, and start allowing their IPs as needed.

You can use http://www.copyscape.com/ to proect your content also, this will at least tell you who is reusing your data.

See this question also:

Protection from screen scraping

Also take a look at

http://blockscraping.com/

Nice doc about screen scraping:

http://www.realtor.org/wps/wcm/connect/5f81390048be35a9b1bbff0c8bc1f2ed/scraping_sum_jun_04.pdf?MOD=AJPERES&CACHEID=5f81390048be35a9b1bbff0c8bc1f2ed

How to prevent screen scraping:

http://mvark.blogspot.com/2007/02/how-to-prevent-screen-scraping.html

寒尘 2024-09-06 03:23:16

拔下服务器的网络电缆。

释义:如果公众能看到它,它就可以被刮掉。

更新:再看一遍,我似乎没有回答这个问题。对不起。 Vecdid 提供了一个很好的答案。

但任何半点像样的编码都可能破坏所列出的措施。在这种情况下,我的回答可以被认为是有效的。

Unplug the network cable to the server.

paraphrase: if public can see it, it can be scraped.

update: upon second look it appears that I am not answering the question. Sorry. Vecdid has offered a good answer.

But any half decent coded could defeat the measures listed. In that context, my answer could be considered valid.

临风闻羌笛 2024-09-06 03:23:16

我认为如果不对您网站的用户进行身份验证,这是不可能的。

I don't think it is possible without authenticating users to your site.

指尖微凉心微凉 2024-09-06 03:23:16

您可以使用验证码。

此外,您还可以通过限制它们的连接来缓解这种情况。它不会完全阻止他们抓取屏幕,但可能会阻止他们获得足够的有用数据。

首先,对于 cookied 用户,限制连接,这样您每秒最多可以看到一个页面视图,但是一旦您的一秒计时器结束,您就不会遇到任何限制。对普通用户没有影响,对屏幕抓取工具有很大影响(至少如果你有很多他们要定位的页面)。

接下来,需要 cookie 来查看数据敏感页面。

他们将能够进入,但只要您不接受伪造的 cookie,他们就无法以任何实际速度筛选大量内容。

You could use a CAPTCHA.

Also, you can mitigate it instead by throttling their connection. It won't completely prevent them from screen scraping but it will probably prevent them from getting enough data to be useful.

First, for cookied users, throttle connections so you can see at most one page view per second, but once your one-second timer is up you experience no throttling whatsoever. No impact on normal users, lots of impact on screen scrapers (at least if you have a lot of pages they're targeting).

Next, require cookies to see the data-sensitive pages.

They'll be able to get in, but as long as you don't accept bogus cookies, they won't be able to screen scrape much with any real speed.

东北女汉子 2024-09-06 03:23:16

最终你无法阻止这一切。

您可以通过设置 robots.txt 文件等方式让人们更难做到这一点。但是您必须将信息获取到合法用户屏幕上,以便以某种方式提供服务,如果是这样,那么您的竞争对手就可以获取它。

如果您强迫用户登录,您可以一直阻止机器人,但无论如何都无法阻止竞争对手注册您的网站。如果潜在客户无法“免费”访问某些信息,这也可能会赶走他们。

Ultimately you can't stop this.

You can make it harder for people to do, by setting up the robots.txt file etc. But you've got to get information onto legitimate users screens so it has to be served somehow, and if it is then your competitors can get to it.

If you force users to log in you can stop the robots all the time, but there's nothing to stop a competitor registering for your site anyway. This may also drive potential customers away if they can't access some information for "free".

暖风昔人 2024-09-06 03:23:16

如果您的竞争对手与您位于同一国家/地区,请在您的网站上明确发布可接受的使用政策和服务条款。提及您不允许任何类型的机器人/屏幕抓取等的事实。如果这种情况继续存在,请让律师向他们发送一封友好的停止信。

If your competitor is in same country as you, have an acceptable use policy and terms of service clearly posted on your site. Mention the fact that you do not allow any sort of robots/screen scraping etc. If that continues, get an attorney to send them a friendly cease and desist letter.

流年已逝 2024-09-06 03:23:16

我认为这是不可能的。但无论你想出什么办法,它都会对搜索引擎优化和竞争产生不利影响。这真的是令人向往的吗?

I don't think that's possible. But whatever you'll come up with, it'll be as bad for search engine optimization as it will be for the competition. Is that really desirable?

忘你却要生生世世 2024-09-06 03:23:16

将每一段文字都作为图像提供怎么样?一旦完成,要么你的竞争对手将被迫投资 OCR 技术,要么你会发现你没有用户 - 所以这个问题将毫无意义。

How about serve up every bit of text as an image? Once that is done, either your competitors will be forced to invest OCR technologies, or you will find that you have no users - so the question will be moot.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文