阻止网络抓取工具

发布于 2024-09-12 23:08:31 字数 44 浏览 3 评论 0原文

网站可以通过哪些方式阻止网络抓取工具?如何确定您的服务器是否被机器人访问?

What are ways that websites can block web scrapers? How can you identify if your server is being accessed by a bot?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

满天都是小星星 2024-09-19 23:08:31
  • 验证码
  • 不到一秒提交的表单
  • 隐藏(通过 css)字段获取表单提交期间提交的值
  • 频繁的页面访问

简单的机器人无法从 Flash、图像或声音中抓取文本。

  • Captchas
  • Form submitted in less than a second
  • Hidden (by css) field gets a value submitted during form submit
  • Frequent page visits

Simple bots can not scrap text from flash, images or sound.

缱绻入梦 2024-09-19 23:08:31

不幸的是,您的问题类似于人们问如何阻止垃圾邮件。没有固定的答案,并且它不会阻止持续存在的某人/机器人。

然而,这里有一些可以实现的方法:

  1. 检查用户代理(但这可能是欺骗性的)
  2. 使用 robots.txt(适当的机器人会 - 希望尊重这一点)
  3. 检测过于一致地访问大量页面的 IP 地址(每个“x “秒)。
  4. 手动或在系统中创建标志来检查谁正在访问您的网站并阻止抓取工具所采用的某些路线。
  5. 不要在您的网站上使用标准模板,并创建通用 css 类 - 并且不要在代码中放入 HTML 注释。

Unfortunately your question is similar to people asking how do you block spam. There's no fixed answer, and it won't stop someone/bot which is persistent.

However, here are some methods that can be implemented:

  1. Check User-Agent (this could be spoofed though)
  2. Use robots.txt (proper bots will - hopefully respect this)
  3. Detect IP addresses that access a lot of pages too consistently (every "x" seconds).
  4. Manually, or create flags in your system to check who all are going on your site and block certain routes the scrapers take.
  5. Don't use a standard template on your site, and create generic css classes - and don't put in HTML comments in your code.
南街女流氓 2024-09-19 23:08:31

您可以使用 robots.txt 来阻止注意到它的机器人(但仍然允许通过其他已知实例,例如谷歌等) - 但这不会阻止那些忽略它的机器人。您也许可以从 Web 服务器日志中获取用户代理,或者您可以更新代码以将其记录在某处。如果您希望阻止特定用户代理访问您的网站,只需返回空/默认屏幕和/或特定服务器代码即可。

You can use robots.txt to block bots that take notice of it (but still let through other known instances such as google, etc) - but that won't stop those that ignore it. You may be able to get the user agent from your web server logs, or you could update your code to record it somewhere. If you then wanted you could block particular user agents from accessing your website, just be returning either a empty/default screen and/or a particular server code.

暗恋未遂 2024-09-19 23:08:31

我不认为有一种方法可以完全满足您的需要,因为在网站爬虫/抓取器中,您可以在请求页面时编辑所有标头,例如用户代理,并且您将无法识别是否存在来自 Mozilla Firefox 的用户或只是一个爬虫/爬虫...

I don't think there is a way of doing exactly what you need, because in websites crawlers/scrapers you can edit all headers when requesting a page, like User-Agent, and you won't be able to identify if there is a user from Mozilla Firefox or just a scraper/crawler...

涙—继续流 2024-09-19 23:08:31

抓取器在某种程度上依赖于页面加载之间标记的一致性。如果您想让他们的日子不好过,请想出一种方法来提供不同请求之间更改的标记。

Scrapers rely to some extent on the consistency of markup from page load to page load. If you want to make life difficult for them, come up with a means of serving altered markup from request to request.

寂寞陪衬 2024-09-19 23:08:31

像“不良行为”这样的内容可能会有所帮助: http://www.bad-behavior.ioerror.us/

来自他们的网站:

Bad Behaviour 旨在集成到您基于 PHP 的网站中,尽早运行以在垃圾邮件机器人有机会用垃圾邮件破坏您的网站之前将其抛出,甚至抓取您的页面以获取电子邮件地址和要填写的表格。

不良行为不仅会阻止对您网站的实际破坏,还会阻止许多电子邮件地址收集者,从而导致电子邮件数量减少垃圾邮件和许多自动化网站破解工具,有助于提高网站的安全性。

Something like "Bad Behavior" might help: http://www.bad-behavior.ioerror.us/

From their site:

Bad Behavior is designed to integrate into your PHP-based Web site, running as early as possible to throw out spam bots before they have the opportunity to vandalize your site with their junk, or even to scrape your pages for e-mail addresses and forms to fill out.

Not only does Bad Behavior block actual vandalism to your site, it also blocks many e-mail address harvesters, resulting in less e-mail spam, and many automated Web site cracking tools, helping to improve your Web site’s security.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文