举个例子,如果您从一个 IP 地址收到许多请求,所有请求都使用相同的用户代理、屏幕大小(由 JavaScript 确定),并且用户(在本例中为抓取工具)始终单击相同的按钮定期地,它可能是一个屏幕抓取工具;并且您可以暂时阻止类似的请求(例如,阻止来自该特定 IP 地址的该用户代理和屏幕尺寸的所有请求),这样您就不会对该 IP 地址上的真实用户造成不便,例如如果共享互联网连接。
您还可以更进一步,因为您可以识别类似的请求,即使它们来自不同的 IP 地址,这表明分布式抓取(使用僵尸网络或代理网络的抓取程序)。如果您收到大量其他相同的请求,但它们来自不同的 IP 地址,您可以阻止。再次强调,请注意不要无意中阻止真实用户。
由于 HTML 解析器的工作原理是根据 HTML 中可识别的模式从页面中提取内容,因此我们可以故意更改这些模式,以破坏这些抓取工具,甚至破坏它们。这些技巧中的大多数也适用于其他抓取工具,例如蜘蛛和屏幕抓取工具。
经常更改直接处理 HTML 的 HTML
抓取器,方法是从 HTML 页面的特定可识别部分提取内容。例如:如果您网站上的所有页面都有一个 id 为 article-content 的 div,其中包含文章的文本,那么编写脚本就很简单了访问您网站上的所有文章页面,并提取每个文章页面上的 article-content div 的内容文本,瞧,抓取工具以一种格式可以读取您网站上的所有文章在其他地方重复使用。
如果您经常更改 HTML 和页面结构,此类抓取工具将不再起作用。
您可以经常更改 HTML 中元素的 ID 和类,甚至可能自动更改。因此,如果您的 div.article-content 变成类似 div.a4c36dda13eaf0 的内容,并且每周都会更改,则抓取工具最初会正常工作,但一周后就会损坏。确保也更改 ids/classes 的长度,否则抓取工具将使用 div.[any-14-characters] 来查找所需的 div。还要提防其他类似的漏洞。
如果无法从标记中找到所需的内容,则抓取工具将从 HTML 的结构方式中找到所需的内容。因此,如果您的所有文章页面都相似,即 h1 之后的 div 内的每个 div 都是文章内容,则抓取工具将得到文章内容以此为基础。同样,要打破这个问题,您可以定期或随机地向 HTML 添加/删除额外的标记,例如。添加额外的 div 或 span。对于现代服务器端 HTML 处理,这应该不会太难。
需要注意的事情:
实现、维护和调试将是乏味且困难的。
你会阻碍缓存。特别是如果您更改 HTML 元素的 id 或类,则需要对 CSS 和 JavaScript 文件进行相应更改,这意味着每次更改它们时,浏览器都必须重新下载它们。这将导致重复访问者的页面加载时间更长,并增加服务器负载。如果你每周只改变一次,那不会有什么大问题。
这与上一个提示有点相似。如果您根据用户的位置/国家/地区(由 IP 地址确定)提供不同的 HTML,则可能会破坏交付给用户的抓取工具。例如,如果有人正在编写一个从您的网站抓取数据的移动应用程序,那么它最初会正常工作,但在实际分发给用户时会崩溃,因为这些用户可能位于不同的国家/地区,因此会获得不同的 HTML,嵌入式刮刀并不是为消耗而设计的。
<div class="search-result">
<h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"search-result-link" href="/stories/story-link">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)
正如您可能已经猜到的,这很容易抓取:所有 a scraper 需要做的就是通过查询点击搜索 URL,并从返回的 HTML 中提取所需的数据。除了如上所述定期更改 HTML 之外,您还可以保留带有旧 id 和类的旧标记,用 CSS 隐藏它,并用虚假数据填充它,从而毒害抓取工具。搜索结果页面的更改方式如下:
<div class="the-real-search-result">
<h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"the-real-search-result-link" href="/stories/story-link">Read more</a>
</div>
<div class="search-result" style="display:none">
<h3 class="search-result-title">Visit Example.com now, for all the latest Stack Overflow related news !</h3>
<p class="search-result-excerpt">Example.com is so awesome, visit now !</p>
<a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)
这意味着根据类或 ID 从 HTML 中提取数据而编写的抓取工具看似会继续工作,但它们会得到虚假数据,甚至是广告,这些数据是真实用户永远不会看到的,因为它们被 CSS 隐藏了。
与抓取工具发生冲突:将虚假的、不可见的蜜罐数据插入到您的页面中
除了前面的示例之外,您还可以将不可见的蜜罐项目添加到 HTML 中以捕获抓取工具。可以添加到前面描述的搜索结果页面的示例:
<div class="search-result" style="display:none">
<h3 class="search-result-title">This search result is here to prevent scraping</h3>
<p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
Note that clicking the link below will block access to this site for 24 hours.</p>
<a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)
为获取所有搜索结果而编写的抓取工具将拾取该结果,就像页面上的任何其他真实搜索结果一样,并访问该链接,查找想要的内容。一个真正的人根本不会看到它(因为它被 CSS 隐藏),也不会访问该链接。真正的、理想的蜘蛛(例如 Google 的蜘蛛)也不会访问该链接,因为您在 robots.txt 中不允许 /scrapertrap/。
您可以让您的 scrapertrap.php 执行一些操作,例如阻止访问它的 IP 地址的访问或对来自该 IP 的所有后续请求强制使用验证码。
Note: Since the complete version of this answer exceeds Stack Overflow's length limit, you'll need to head to GitHub to read the extended version, with more tips and details.
In order to hinder scraping (also known as Webscraping, Screenscraping, Web data mining, Web harvesting, or Web data extraction), it helps to know how these scrapers work, and
, by extension, what prevents them from working well.
There's various types of scraper, and each works differently:
Spiders, such as Google's bot or website copiers like HTtrack, which recursively follow links to other pages in order to get data. These are sometimes used for targeted scraping to get specific data, often in combination with a HTML parser to extract the desired data from each page.
Shell scripts: Sometimes, common Unix tools are used for scraping: Wget or Curl to download pages, and Grep (Regex) to extract the data.
HTML parsers, such as ones based on Jsoup, Scrapy, and others. Similar to shell-script regex based ones, these work by extracting data from pages based on patterns in HTML, usually ignoring everything else.
For example: If your website has a search feature, such a scraper might submit a request for a search, and then get all the result links and their titles from the results page HTML, in order to specifically get only search result links and their titles. These are the most common.
Screenscrapers, based on eg. Selenium or PhantomJS, which open your website in a real browser, run JavaScript, AJAX, and so on, and then get the desired text from the webpage, usually by:
Getting the HTML from the browser after your page has been loaded and JavaScript has run, and then using a HTML parser to extract the desired data. These are the most common, and so many of the methods for breaking HTML parsers / scrapers also work here.
Taking a screenshot of the rendered pages, and then using OCR to extract the desired text from the screenshot. These are rare, and only dedicated scrapers who really want your data will set this up.
Webscraping services such as ScrapingHub or Kimono. In fact, there's people whose job is to figure out how to scrape your site and pull out the content for others to use.
Unsurprisingly, professional scraping services are the hardest to deter, but if you make it hard and time-consuming to figure out how to scrape your site, these (and people who pay them to do so) may not be bothered to scrape your website.
Embedding your website in other site's pages with frames, and embedding your site in mobile apps.
While not technically scraping, mobile apps (Android and iOS) can embed websites, and inject custom CSS and JavaScript, thus completely changing the appearance of your pages.
Human copy - paste: People will copy and paste your content in order to use it elsewhere.
There is a lot overlap between these different kinds of scraper, and many scrapers will behave similarly, even if they use different technologies and methods.
These tips mostly my own ideas, various difficulties that I've encountered while writing scrapers, as well as bits of information and ideas from around the interwebs.
How to stop scraping
You can't completely prevent it, since whatever you do, determined scrapers can still figure out how to scrape. However, you can stop a lot of scraping by doing a few things:
Monitor your logs & traffic patterns; limit access if you see unusual activity:
Check your logs regularly, and in case of unusual activity indicative of automated access (scrapers), such as many similar actions from the same IP address, you can block or limit access.
Specifically, some ideas:
Rate limiting:
Only allow users (and scrapers) to perform a limited number of actions in a certain time - for example, only allow a few searches per second from any specific IP address or user. This will slow down scrapers, and make them ineffective. You could also show a captcha if actions are completed too fast or faster than a real user would.
Detect unusual activity:
If you see unusual activity, such as many similar requests from a specific IP address, someone looking at an excessive number of pages or performing an unusual number of searches, you can prevent access, or show a captcha for subsequent requests.
Don't just monitor & rate limit by IP address - use other indicators too:
If you do block or rate limit, don't just do it on a per-IP address basis; you can use other indicators and methods to identify specific users or scrapers. Some indicators which can help you identify specific users / scrapers include:
How fast users fill out forms, and where on a button they click;
You can gather a lot of information with JavaScript, such as screen size / resolution, timezone, installed fonts, etc; you can use this to identify users.
HTTP headers and their order, especially User-Agent.
As an example, if you get many request from a single IP address, all using the same User Agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it's probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won't inconvenience real users on that IP address, eg. in case of a shared internet connection.
You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users.
This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them.
Instead of temporarily blocking access, use a Captcha:
The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.
Require registration & login
Require account creation in order to view your content, if this is feasible for your site. This is a good deterrent for scrapers, but is also a good deterrent for real users.
If you require account creation and login, you can accurately track user and scraper actions. This way, you can easily detect when a specific account is being used for scraping, and ban it. Things like rate limiting or detecting abuse (such as a huge number of searches in a short time) become easier, as you can identify specific scrapers instead of just IP addresses.
In order to avoid scripts creating many accounts, you should:
Require an email address for registration, and verify that email address by sending a link that must be opened in order to activate the account. Allow only one account per email address.
Require a captcha to be solved during registration / account creation.
Requiring account creation to view content will drive users and search engines away; if you require account creation in order to view an article, users will go elsewhere.
Block access from cloud hosting and scraping service IP addresses
Sometimes, scrapers will be run from web hosting services, such as Amazon Web Services or GAE, or VPSes. Limit access to your website (or show a captcha) for requests originating from the IP addresses used by such cloud hosting services.
Similarly, you can also limit access from IP addresses used by proxy or VPN providers, as scrapers may use such proxy servers to avoid many requests being detected.
Beware that by blocking access from proxy servers and VPNs, you will negatively affect real users.
Make your error message nondescript if you do block
If you do block / limit access, you should ensure that you don't tell the scraper what caused the block, thereby giving them clues as to how to fix their scraper. So a bad idea would be to show error pages with text like:
Too many requests from your IP address, please try again later.
Error, User Agent header not present !
Instead, show a friendly error message that doesn't tell the scraper what caused it. Something like this is much better:
Sorry, something went wrong. You can contact support via [email protected], should the problem persist.
This is also a lot more user friendly for real users, should they ever see such an error page. You should also consider showing a captcha for subsequent requests instead of a hard block, in case a real user sees the error message, so that you don't block and thus cause legitimate users to contact you.
Use Captchas if you suspect that your website is being accessed by a scraper.
Captchas ("Completely Automated Test to Tell Computers and Humans apart") are very effective against stopping scrapers. Unfortunately, they are also very effective at irritating users.
As such, they are useful when you suspect a possible scraper, and want to stop the scraping, without also blocking access in case it isn't a scraper but a real user. You might want to consider showing a captcha before allowing access to the content if you suspect a scraper.
Things to be aware of when using Captchas:
Don't roll your own, use something like Google's reCaptcha : It's a lot easier than implementing a captcha yourself, it's more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it's also a lot harder for a scripter to solve than a simple image served from your site
Don't include the solution to the captcha in the HTML markup: I've actually seen one website which had the solution for the captcha in the page itself, (although quite well hidden) thus making it pretty useless. Don't do something like this. Again, use a service like reCaptcha, and you won't have this kind of problem (if you use it properly).
Captchas can be solved in bulk: There are captcha-solving services where actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to be used unless your data is really valuable.
Serve your text content as an image
You can render text into an image server-side, and serve that to be displayed, which will hinder simple scrapers extracting text.
However, this is bad for screen readers, search engines, performance, and pretty much everything else. It's also illegal in some places (due to accessibility, eg. the Americans with Disabilities Act), and it's also easy to circumvent with some OCR, so don't do it.
You can do something similar with CSS sprites, but that suffers from the same problems.
Don't expose your complete dataset:
If feasible, don't provide a way for a script / bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don't have a list of all the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.
This will be ineffective if:
The bot / script does not want / need the full dataset anyway.
Your articles are served from a URL which looks something like example.com/article.php?articleId=12345. This (and similar things) which will allow scrapers to simply iterate over all the articleIds and request all the articles that way.
There are other ways to eventually find all the articles, such as by writing a script to follow links within articles which lead to other articles.
Searching for something like "and" or "the" can reveal almost everything, so that is something to be aware of. (You can avoid this by only returning the top 10 or 20 results).
You need search engines to find your content.
Don't expose your APIs, endpoints, and similar things:
Make sure you don't expose any APIs, even unintentionally. For example, if you are using AJAX or network requests from within Adobe Flash or Java Applets (God forbid!) to load your data it is trivial to look at the network requests from the page and figure out where those requests are going to, and then reverse engineer and use those endpoints in a scraper program. Make sure you obfuscate your endpoints and make them hard for others to use, as described.
To deter HTML parsers and scrapers:
Since HTML parsers work by extracting content from pages based on identifiable patterns in the HTML, we can intentionally change those patterns in oder to break these scrapers, or even screw with them. Most of these tips also apply to other scrapers like spiders and screenscrapers too.
Frequently change your HTML
Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a div with an id of article-content, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of the article-content div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.
If you change the HTML and the structure of your pages frequently, such scrapers will no longer work.
You can frequently change the id's and classes of elements in your HTML, perhaps even automatically. So, if your div.article-content becomes something like div.a4c36dda13eaf0, and changes every week, the scraper will work fine initially, but will break after a week. Make sure to change the length of your ids / classes too, otherwise the scraper will use div.[any-14-characters] to find the desired div instead. Beware of other similar holes too..
If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every div inside a div which comes after a h1 is the article content, scrapers will get the article content based on that. Again, to break this, you can add / remove extra markup to your HTML, periodically and randomly, eg. adding extra divs or spans. With modern server side HTML processing, this should not be too hard.
Things to be aware of:
It will be tedious and difficult to implement, maintain, and debug.
You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem.
Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. Boilerpipe does exactly this.
Essentially, make sure that it is not easy for a script to find the actual, desired content for every similar page.
This is sort of similar to the previous tip. If you serve different HTML based on your user's location / country (determined by IP address), this may break scrapers which are delivered to users. For example, if someone is writing a mobile app which scrapes data from your site, it will work fine initially, but break when it's actually distributed to users, as those users may be in a different country, and thus get different HTML, which the embedded scraper was not designed to consume.
Frequently change your HTML, actively screw with the scrapers by doing so !
An example: You have a search feature on your website, located at example.com/search?query=somesearchquery, which returns the following HTML:
<div class="search-result">
<h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"search-result-link" href="/stories/story-link">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)
As you may have guessed this is easy to scrape: all a scraper needs to do is hit the search URL with a query, and extract the desired data from the returned HTML. In addition to periodically changing the HTML as described above, you could also leave the old markup with the old ids and classes in, hide it with CSS, and fill it with fake data, thereby poisoning the scraper. Here's how the search results page could be changed:
<div class="the-real-search-result">
<h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"the-real-search-result-link" href="/stories/story-link">Read more</a>
</div>
<div class="search-result" style="display:none">
<h3 class="search-result-title">Visit Example.com now, for all the latest Stack Overflow related news !</h3>
<p class="search-result-excerpt">Example.com is so awesome, visit now !</p>
<a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)
This will mean that scrapers written to extract data from the HTML based on classes or IDs will continue to seemingly work, but they will get fake data or even ads, data which real users will never see, as they're hidden with CSS.
Screw with the scraper: Insert fake, invisible honeypot data into your page
Adding on to the previous example, you can add invisible honeypot items to your HTML to catch scrapers. An example which could be added to the previously described search results page:
<div class="search-result" style="display:none">
<h3 class="search-result-title">This search result is here to prevent scraping</h3>
<p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
Note that clicking the link below will block access to this site for 24 hours.</p>
<a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)
A scraper written to get all the search results will pick this up, just like any of the other, real search results on the page, and visit the link, looking for the desired content. A real human will never even see it in the first place (due to it being hidden with CSS), and won't visit the link. A genuine and desirable spider such as Google's will not visit the link either because you disallowed /scrapertrap/ in your robots.txt.
You can make your scrapertrap.php do something like block access for the IP address that visited it or force a captcha for all subsequent requests from that IP.
Don't forget to disallow your honeypot (/scrapertrap/) in your robots.txt file so that search engine bots don't fall into it.
You can / should combine this with the previous tip of changing your HTML frequently.
Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a style attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip.
Malicious people can prevent access for real users by sharing a link to your honeypot, or even embedding that link somewhere as an image (eg. on a forum). Change the URL frequently, and make any ban times relatively short.
Serve fake and useless data if you detect a scraper
If you detect what is obviously a scraper, you can serve up fake and useless data; this will corrupt the data the scraper gets from your website. You should also make it impossible to distinguish such fake data from real data, so that scrapers don't know that they're being screwed with.
As an example: you have a news website; if you detect a scraper, instead of blocking access, serve up fake, randomly generated articles, and this will poison the data the scraper gets. If you make your fake data indistinguishable from the real thing, you'll make it hard for scrapers to get what they want, namely the actual, real data.
Don't accept requests if the User Agent is empty / missing
Often, lazily written scrapers will not send a User Agent header with their request, whereas all browsers as well as search engine spiders will.
If you get a request where the User Agent header is not present, you can show a captcha, or simply block or limit access. (Or serve fake data as described above, or something else..)
It's trivial to spoof, but as a measure against poorly written scrapers it is worth implementing.
Don't accept requests if the User Agent is a common scraper one; blacklist ones used by scrapers
In some cases, scrapers will use a User Agent which no real browser or search engine spider uses, such as:
"Mozilla" (Just that, nothing else. I've seen a few questions about scraping here, using that. A real browser will never use only that)
"Java 1.7.43_u43" (By default, Java's HttpUrlConnection uses something like this.)
"BIZCO EasyScraping Studio 2.0"
"wget", "curl", "libcurl",.. (Wget and cURL are sometimes used for basic scraping)
If you find that a specific User Agent string is used by scrapers on your site, and it is not used by real browsers or legitimate spiders, you can also add it to your blacklist.
If it doesn't request assets (CSS, images), it's not a real browser.
A real browser will (almost always) request and download assets such as images and CSS. HTML parsers and scrapers won't as they are only interested in the actual pages and their content.
You could log requests to your assets, and if you see lots of requests for only the HTML, it may be a scraper.
Beware that search engine bots, ancient mobile devices, screen readers and misconfigured devices may not request assets either.
Use and require cookies; use them to track user and scraper actions.
You can require cookies to be enabled in order to view your website. This will deter inexperienced and newbie scraper writers, however it is easy to for a scraper to send cookies. If you do use and require them, you can track user and scraper actions with them, and thus implement rate-limiting, blocking, or showing captchas on a per-user instead of a per-IP basis.
For example: when the user performs search, set a unique identifying cookie. When the results pages are viewed, verify that cookie. If the user opens all the search results (you can tell from the cookie), then it's probably a scraper.
Using cookies may be ineffective, as scrapers can send the cookies with their requests too, and discard them as needed. You will also prevent access for real users who have cookies disabled, if your site only works with cookies.
Note that if you use JavaScript to set and retrieve the cookie, you'll block scrapers which don't run JavaScript, since they can't retrieve and send the cookie with their request.
Use JavaScript + Ajax to load your content
You could use JavaScript + AJAX to load your content after the page itself loads. This will make the content inaccessible to HTML parsers which do not run JavaScript. This is often an effective deterrent to newbie and inexperienced programmers writing scrapers.
Be aware of:
Using JavaScript to load the actual content will degrade user experience and performance
Search engines may not run JavaScript either, thus preventing them from indexing your content. This may not be a problem for search results pages, but may be for other things, such as article pages.
Obfuscate your markup, network requests from scripts, and everything else.
If you use Ajax and JavaScript to load your data, obfuscate the data which is transferred. As an example, you could encode your data on the server (with something as simple as base64 or more complex), and then decode and display it on the client, after fetching via Ajax. This will mean that someone inspecting network traffic will not immediately see how your page works and loads data, and it will be tougher for someone to directly request request data from your endpoints, as they will have to reverse-engineer your descrambling algorithm.
If you do use Ajax for loading the data, you should make it hard to use the endpoints without loading the page first, eg by requiring some session key as a parameter, which you can embed in your JavaScript or your HTML.
You can also embed your obfuscated data directly in the initial HTML page and use JavaScript to deobfuscate and display it, which would avoid the extra network requests. Doing this will make it significantly harder to extract the data using a HTML-only parser which does not run JavaScript, as the one writing the scraper will have to reverse engineer your JavaScript (which you should obfuscate too).
You might want to change your obfuscation methods regularly, to break scrapers who have figured it out.
There are several disadvantages to doing something like this, though:
It will be tedious and difficult to implement, maintain, and debug.
It will be ineffective against scrapers and screenscrapers which actually run JavaScript and then extract the data. (Most simple HTML parsers don't run JavaScript though)
It will make your site nonfunctional for real users if they have JavaScript disabled.
Performance and page-load times will suffer.
Non-Technical:
Tell people not to scrape, and some will respect it
Find a lawyer
Make your data available, provide an API:
You could make your data easily available and require attribution and a link back to your site. Perhaps charge $$$ for it.
Miscellaneous:
There are also commercial scraping protection services, such as the anti-scraping by Cloudflare or Distill Networks (Details on how it works here), which do these things, and more for you.
Find a balance between usability for real users and scraper-proofness: Everything you do will impact user experience negatively in one way or another, find compromises.
Don't forget your mobile site and apps. If you have a mobile app, that can be screenscraped too, and network traffic can be inspected to determine the REST endpoints it uses.
Scrapers can scrape other scrapers: If there's one website which has content scraped from yours, other scrapers can scrape from that scraper's website.
As others have mentioned, scrapers can fake nearly every aspect of their activities, and it is probably very difficult to identify the requests that are coming from the bad guys.
I would consider:
Set up a page, /jail.html.
Disallow access to the page in robots.txt (so the respectful spiders will never visit).
Place a link on one of your pages, hiding it with CSS (display: none).
Record IP addresses of visitors to /jail.html.
This might help you to quickly identify requests from scrapers that are flagrantly disregarding your robots.txt.
You might also want to make your /jail.html a whole entire website that has the same, exact markup as normal pages, but with fake data (/jail/album/63ajdka, /jail/track/3aads8, etc.). This way, the bad scrapers won't be alerted to "unusual input" until you have the chance to block them entirely.
如果这会让您搞乱您的 HTML 代码,拖累 SEO、有效性和其他事情,那将是一种耻辱(即使模板系统在每个相同页面的请求上使用稍微不同的 HTML 结构可能已经帮助了 很多反对总是依赖 HTML 结构和类/ID 名称来获取内容的抓取工具。)
像这样的情况正是版权法所擅长的。剽窃他人的诚实作品来赚钱是你应该能够反对的行为。
Sue 'em.
Seriously: If you have some money, talk to a good, nice, young lawyer who knows their way around the Internets. You could really be able to do something here. Depending on where the sites are based, you could have a lawyer write up a cease & desist or its equivalent in your country. You may be able to at least scare the bastards.
Document the insertion of your dummy values. Insert dummy values that clearly (but obscurely) point to you. I think this is common practice with phone book companies, and here in Germany, I think there have been several instances when copycats got busted through fake entries they copied 1:1.
It would be a shame if this would drive you into messing up your HTML code, dragging down SEO, validity and other things (even though a templating system that uses a slightly different HTML structure on each request for identical pages might already help a lot against scrapers that always rely on HTML structures and class/ID names to get the content out.)
Cases like this are what copyright laws are good for. Ripping off other people's honest work to make money with is something that you should be able to fight against.
提供 XML API 来访问您的数据;以简单易用的方式。如果人们想要你的数据,他们就会得到,你还不如全力以赴。
通过这种方式,您可以有效地提供功能子集,至少确保抓取工具不会消耗 HTTP 请求和大量带宽。
然后您所要做的就是说服想要您的数据的人使用 API。 ;)
Provide an XML API to access your data; in a manner that is simple to use. If people want your data, they'll get it, you might as well go all out.
This way you can provide a subset of functionality in an effective manner, ensuring that, at the very least, the scrapers won't guzzle up HTTP requests and massive amounts of bandwidth.
Then all you have to do is convince the people who want your data to use the API. ;)
您确实无法采取任何措施来完全防止这种情况发生。抓取工具可以伪造其用户代理、使用多个 IP 地址等并显示为普通用户。您唯一能做的就是使文本在页面加载时不可用 - 使用图像、Flash 或使用 JavaScript 加载它。然而,前两个都是坏主意,如果某些普通用户没有启用 JavaScript,最后一个将是可访问性问题。
如果他们绝对攻击您的网站并浏览您的所有页面,您可以进行某种速率限制。
不过还是有一些希望的。抓取工具依赖于您网站的数据采用一致的格式。如果你能以某种方式随机化它,它可能会破坏他们的刮刀。诸如在每次加载时更改页面元素的 ID 或类名称等之类的事情。但这需要做很多工作,而且我不确定是否值得。即便如此,他们也可能通过足够的奉献精神来解决这个问题。
There is really nothing you can do to completely prevent this. Scrapers can fake their user agent, use multiple IP addresses, etc. and appear as a normal user. The only thing you can do is make the text not available at the time the page is loaded - make it with image, flash, or load it with JavaScript. However, the first two are bad ideas, and the last one would be an accessibility issue if JavaScript is not enabled for some of your regular users.
If they are absolutely slamming your site and rifling through all of your pages, you could do some kind of rate limiting.
There is some hope though. Scrapers rely on your site's data being in a consistent format. If you could randomize it somehow it could break their scraper. Things like changing the ID or class names of page elements on each load, etc. But that is a lot of work to do and I'm not sure if it's worth it. And even then, they could probably get around it with enough dedication.
如果某个特定的 IP 地址访问速度非常快,则在几次访问(5-10 次)后,将其 IP 地址 + 浏览器信息放入文件或数据库中。
下一步
(这将是一个后台进程,并且始终运行或在几分钟后计划运行。)制作另一个脚本来继续检查那些可疑的 IP 地址。
情况 1. 如果用户 Agent 是 Google 等已知搜索引擎,Bing,Yahoo(您可以通过谷歌搜索找到有关用户代理的更多信息)。那么您必须看到 http://www.iplists.com/。这个列表并尝试匹配模式。如果它看起来像一个伪造的用户代理,那么要求在下一个页面填写 CAPTCHA访问。 (您需要对机器人 IP 地址进行更多研究。我知道这是可以实现的,并且还可以尝试 IP 地址的 whois。这会很有帮助。)
案例 2. 搜索机器人没有用户代理:只需要求填写一个下次访问时验证码。
Okay, as all posts say, if you want to make it search engine-friendly then bots can scrape for sure.
But you can still do a few things, and it may be affective for 60-70 % scraping bots.
Make a checker script like below.
If a particular IP address is visiting very fast then after a few visits (5-10) put its IP address + browser information in a file or database.
The next step
(This would be a background process and running all time or scheduled after a few minutes.) Make one another script that will keep on checking those suspicious IP addresses.
Case 1. If the user Agent is of a known search engine like Google, Bing, Yahoo (you can find more information on user agents by googling it). Then you must see http://www.iplists.com/. This list and try to match patterns. And if it seems like a faked user-agent then ask to fill in a CAPTCHA on the next visit. (You need to research a bit more on bots IP addresses. I know this is achievable and also try whois of the IP address. It can be helpful.)
Case 2. No user agent of a search bot: Simply ask to fill in a CAPTCHA on the next visit.
Late answer - and also this answer probably isn't the one you want to hear...
Myself already wrote many (many tens) of different specialized data-mining scrapers. (just because I like the "open data" philosophy).
Here are already many advices in other answers - now i will play the devil's advocate role and will extend and/or correct their effectiveness.
First:
if someone really wants your data
you can't effectively (technically) hide your data
if the data should be publicly accessible to your "regular users"
Trying to use some technical barriers aren't worth the troubles, caused:
to your regular users by worsening their user-experience
to regular and welcomed bots (search engines)
etc...
Plain HMTL - the easiest way is parse the plain HTML pages, with well defined structure and css classes. E.g. it is enough to inspect element with Firebug, and use the right Xpaths, and/or CSS path in my scraper.
You could generate the HTML structure dynamically and also, you can generate dynamically the CSS class-names (and the CSS itself too) (e.g. by using some random class names) - but
you want to present the informations to your regular users in consistent way
e.g. again - it is enough to analyze the page structure once more to setup the scraper.
and it can be done automatically by analyzing some "already known content"
once someone already knows (by earlier scrape), e.g.:
what contains the informations about "phil collins"
enough display the "phil collins" page and (automatically) analyze how the page is structured "today" :)
You can't change the structure for every response, because your regular users will hate you. Also, this will cause more troubles for you (maintenance) not for the scraper. The XPath or CSS path is determinable by the scraping script automatically from the known content.
Ajax - little bit harder in the start, but many times speeds up the scraping process :) - why?
When analyzing the requests and responses, i just setup my own proxy server (written in perl) and my firefox is using it. Of course, because it is my own proxy - it is completely hidden - the target server see it as regular browser. (So, no X-Forwarded-for and such headers).
Based on the proxy logs, mostly is possible to determine the "logic" of the ajax requests, e.g. i could skip most of the html scraping, and just use the well-structured ajax responses (mostly in JSON format).
So, the ajax doesn't helps much...
Some more complicated are pages which uses muchpacked javascript functions.
Here is possible to use two basic methods:
unpack and understand the JS and create a scraper which follows the Javascript logic (the hard way)
or (preferably using by myself) - just using Mozilla with Mozrepl for scrape. E.g. the real scraping is done in full featured javascript enabled browser, which is programmed to clicking to the right elements and just grabbing the "decoded" responses directly from the browser window.
Such scraping is slow (the scraping is done as in regular browser), but it is
very easy to setup and use
and it is nearly impossible to counter it :)
and the "slowness" is needed anyway to counter the "blocking the rapid same IP based requests"
The User-Agent based filtering doesn't helps at all. Any serious data-miner will set it to some correct one in his scraper.
Require Login - doesn't helps. The simplest way beat it (without any analyze and/or scripting the login-protocol) is just logging into the site as regular user, using Mozilla and after just run the Mozrepl based scraper...
Remember, the require login helps for anonymous bots, but doesn't helps against someone who want scrape your data. He just register himself to your site as regular user.
Using frames isn't very effective also. This is used by many live movie services and it not very hard to beat. The frames are simply another one HTML/Javascript pages what are needed to analyze... If the data worth the troubles - the data-miner will do the required analyze.
IP-based limiting isn't effective at all - here are too many public proxy servers and also here is the TOR... :) It doesn't slows down the scraping (for someone who really wants your data).
Very hard is scrape data hidden in images. (e.g. simply converting the data into images server-side). Employing "tesseract" (OCR) helps many times - but honestly - the data must worth the troubles for the scraper. (which many times doesn't worth).
On the other side, your users will hate you for this. Myself, (even when not scraping) hate websites which doesn't allows copy the page content into the clipboard (because the information are in the images, or (the silly ones) trying to bond to the right click some custom Javascript event. :)
The hardest are the sites which using java applets or flash, and the applet uses secure https requests itself internally. But think twice - how happy will be your iPhone users... ;). Therefore, currently very few sites using them. Myself, blocking all flash content in my browser (in regular browsing sessions) - and never using sites which depends on Flash.
Your milestones could be..., so you can try this method - just remember - you will probably loose some of your users. Also remember, some SWF files are decompilable. ;)
Captcha (the good ones - like reCaptcha) helps a lot - but your users will hate you... - just imagine, how your users will love you when they need solve some captchas in all pages showing informations about the music artists.
Probably don't need to continue - you already got into the picture.
Now what you should do:
Remember: It is nearly impossible to hide your data, if you on the other side want publish them (in friendly way) to your regular users.
So,
make your data easily accessible - by some API
this allows the easy data access
e.g. offload your server from scraping - good for you
setup the right usage rights (e.g. for example must cite the source)
remember, many data isn't copyright-able - and hard to protect them
add some fake data (as you already done) and use legal tools
as others already said, send an "cease and desist letter"
other legal actions (sue and like) probably is too costly and hard to win (especially against non US sites)
Think twice before you will try to use some technical barriers.
Rather as trying block the data-miners, just add more efforts to your website usability. Your user will love you. The time (&energy) invested into technical barriers usually aren't worth - better to spend the time to make even better website...
Also, data-thieves aren't like normal thieves.
If you buy an inexpensive home alarm and add an warning "this house is connected to the police" - many thieves will not even try to break into. Because one wrong move by him - and he going to jail...
So, you investing only few bucks, but the thief investing and risk much.
But the data-thief hasn't such risks. just the opposite - ff you make one wrong move (e.g. if you introduce some BUG as a result of technical barriers), you will loose your users. If the the scraping bot will not work for the first time, nothing happens - the data-miner just will try another approach and/or will debug the script.
In this case, you need invest much more - and the scraper investing much less.
Just think where you want invest your time & energy...
Ps: english isn't my native - so forgive my broken english...
I have done a lot of web scraping and summarized some techniques to stop web scrapers on my blog based on what I find annoying.
It is a tradeoff between your users and scrapers. If you limit IP's, use CAPTCHA's, require login, etc, you make like difficult for the scrapers. But this may also drive away your genuine users.
From a tech perspective:
Just model what Google does when you hit them with too many queries at once. That should put a halt to a lot of it.
From a legal perspective:
It sounds like the data you're publishing is not proprietary. Meaning you're publishing names and stats and other information that cannot be copyrighted.
If this is the case, the scrapers are not violating copyright by redistributing your information about artist name etc. However, they may be violating copyright when they load your site into memory because your site contains elements that are copyrightable (like layout etc).
I recommend reading about Facebook v. Power.com and seeing the arguments Facebook used to stop screen scraping. There are many legal ways you can go about trying to stop someone from scraping your website. They can be far reaching and imaginative. Sometimes the courts buy the arguments. Sometimes they don't.
But, assuming you're publishing public domain information that's not copyrightable like names and basic stats... you should just let it go in the name of free speech and open data. That is, what the web's all about.
Your best option is unfortunately fairly manual: Look for traffic patterns that you believe are indicative of scraping and ban their IP addresses.
Since you're talking about a public site then making the site search-engine friendly will also make the site scraping-friendly. If a search-engine can crawl and scrape your site then an malicious scraper can as well. It's a fine-line to walk.
Sure it's possible. For 100% success, take your site offline.
In reality you can do some things that make scraping a little more difficult. Google does browser checks to make sure you're not a robot scraping search results (although this, like most everything else, can be spoofed).
You can do things like require several seconds between the first connection to your site, and subsequent clicks. I'm not sure what the ideal time would be or exactly how to do it, but that's another idea.
I'm sure there are several other people who have a lot more experience, but I hope those ideas are at least somewhat helpful.
Embrace it. Why not publish as RDFa and become super search engine friendly and encourage the re-use of data? People will thank you and provide credit where due (see musicbrainz as an example).
It is not the answer you probably want, but why hide what you're trying to make public?
如果我必须这样做,我可能会结合使用后三种,因为它们可以最大限度地减少对合法用户的不便。然而,你必须接受这样的事实:你无法以这种方式阻止所有人,一旦有人找到了绕过它的方法,他们就可以永远绕过它。我想,当您发现他们的 IP 地址时,您可以尝试阻止它们。
There are a few things you can do to try and prevent screen scraping. Some are not very effective, while others (a CAPTCHA) are, but hinder usability. You have to keep in mind too that it may hinder legitimate site scrapers, such as search engine indexes.
However, I assume that if you don't want it scraped that means you don't want search engines to index it either.
Here are some things you can try:
Show the text in an image. This is quite reliable, and is less of a pain on the user than a CAPTCHA, but means they won't be able to cut and paste and it won't scale prettily or be accessible.
Use a CAPTCHA and require it to be completed before returning the page. This is a reliable method, but also the biggest pain to impose on a user.
Require the user to sign up for an account before viewing the pages, and confirm their email address. This will be pretty effective, but not totally - a screen-scraper might set up an account and might cleverly program their script to log in for them.
If the client's user-agent string is empty, block access. A site-scraping script will often be lazily programmed and won't set a user-agent string, whereas all web browsers will.
You can set up a black list of known screen scraper user-agent strings as you discover them. Again, this will only help the lazily-coded ones; a programmer who knows what he's doing can set a user-agent string to impersonate a web browser.
Change the URL path often. When you change it, make sure the old one keeps working, but only for as long as one user is likely to have their browser open. Make it hard to predict what the new URL path will be. This will make it difficult for scripts to grab it if their URL is hard-coded. It'd be best to do this with some kind of script.
If I had to do this, I'd probably use a combination of the last three, because they minimise the inconvenience to legitimate users. However, you'd have to accept that you won't be able to block everyone this way and once someone figures out how to get around it, they'll be able to scrape it forever. You could then just try to block their IP addresses as you discover them I guess.
Method One (Small Sites Only):
Serve encrypted / encoded data. I Scape the web using python (urllib, requests, beautifulSoup etc...) and found many websites that serve encrypted / encoded data that is not decrypt-able in any programming language simply because the encryption method does not exist.
I achieved this in a PHP website by encrypting and minimizing the output (WARNING: this is not a good idea for large sites) the response was always jumbled content.
<?php
function sanitize_output($buffer) {
$search = array(
'/\>[^\S ]+/s', // strip whitespaces after tags, except space
'/[^\S ]+\</s', // strip whitespaces before tags, except space
'/(\s)+/s' // shorten multiple whitespace sequences
);
$replace = array('>', '<', '\\1');
$buffer = preg_replace($search, $replace, $buffer);
return $buffer;
}
ob_start("sanitize_output");
?>
Method Two:
If you can't stop them screw them over serve fake / useless data as a response.
Method Three:
block common scraping user agents, you'll see this in major / large websites as it is impossible to scrape them with "python3.4" as you User-Agent.
Method Four:
Make sure all the user headers are valid, I sometimes provide as many headers as possible to make my scraper seem like an authentic user, some of them are not even true or valid like en-FU :).
Here is a list of some of the headers I commonly provide.
Rather than blacklisting bots, maybe you should whitelist them. If you don't want to kill your search results for the top few engines, you can whitelist their user-agent strings, which are generally well-publicized. The less ethical bots tend to forge user-agent strings of popular web browsers. The top few search engines should be driving upwards of 95% of your traffic.
Identifying the bots themselves should be fairly straightforward, using the techniques other posters have suggested.
Quick approach to this would be to set a booby/bot trap.
Make a page that if it's opened a certain amount of times or even opened at all, will collect certain information like the IP and whatnot (you can also consider irregularities or patterns but this page shouldn't have to be opened at all).
Make a link to this in your page that is hidden with CSS display:none; or left:-9999px; positon:absolute; try to place it in places that are less unlikely to be ignored like where your content falls under and not your footer as sometimes bots can choose to forget about certain parts of a page.
In your robots.txt file set a whole bunch of disallow rules to pages you don't want friendly bots (LOL, like they have happy faces!) to gather information on and set this page as one of them.
Now, If a friendly bot comes through it should ignore that page. Right but that still isn't good enough. Make a couple more of these pages or somehow re-route a page to accept differnt names. and then place more disallow rules to these trap pages in your robots.txt file alongside pages you want ignored.
Collect the IP of these bots or anyone that enters into these pages, don't ban them but make a function to display noodled text in your content like random numbers, copyright notices, specific text strings, display scary pictures, basically anything to hinder your good content. You can also set links that point to a page which will take forever to load ie. in php you can use the sleep() function. This will fight the crawler back if it has some sort of detection to bypass pages that take way too long to load as some well written bots are set to process X amount of links at a time.
If you have made specific text strings/sentences why not go to your favorite search engine and search for them, it might show you where your content is ending up.
Anyway, if you think tactically and creatively this could be a good starting point. The best thing to do would be to learn how a bot works.
I'd also think about scambling some ID's or the way attributes on the page element are displayed:
You can't stop normal screen scraping. For better or worse, it's the nature of the web.
You can make it so no one can access certain things (including music files) unless they're logged in as a registered user. It's not too difficult to do in Apache. I assume it wouldn't be too difficult to do in IIS as well.
我很确定这会让他们的工作变得复杂。我也曾经因为速率限制(我使用了简单的 AJAX 请求循环)而试图废弃受 CloudFlare 保护的站点的数据,因此 IP 被自动禁止 4 个月。
Most have been already said, but have you considered the CloudFlare protection? I mean this:
Other companies probably do this too, CloudFlare is the only one I know.
I'm pretty sure that would complicate their work. I also once got IP banned automatically for 4 months when I tried to scrap data of a site protected by CloudFlare due to rate limit (I used simple AJAX request loop).
One way would be to serve the content as XML attributes, URL encoded strings, preformatted text with HTML encoded JSON, or data URIs, then transform it to HTML on the client. Here are a few sites which do this:
I agree with most of the posts above, and I'd like to add that the more search engine friendly your site is, the more scrape-able it would be. You could try do a couple of things that are very out there that make it harder for scrapers, but it might also affect your search-ability... It depends on how well you want your site to rank on search engines of course.
Putting your content behind a captcha would mean that robots would find it difficult to access your content. However, humans would be inconvenienced so that may be undesirable.
If you want to see a great example, check out http://www.bkstr.com/. They use a j/s algorithm to set a cookie, then reloads the page so it can use the cookie to validate that the request is being run within a browser. A desktop app built to scrape could definitely get by this, but it would stop most cURL type scraping.
Screen scrapers work by processing HTML. And if they are determined to get your data there is not much you can do technically because the human eyeball processes anything. Legally it's already been pointed out you may have some recourse though and that would be my recommendation.
However, you can hide the critical part of your data by using non-HTML-based presentation logic
Generate an image for each artist content. Maybe just an image for the artist name, etc. would be enough. Do this by rendering the text onto a JPEG/PNG file on the server and linking to that image.
Bear in mind that this would probably affect your search rankings.
Generate the HTML, CSS and JavaScript. It is easier to write generators than parsers, so you could generate each served page differently. You can no longer use a cache or static content then.
发布评论
评论(26)
注意:由于此答案的完整版本超出了 Stack Overflow 的长度限制,因此您需要 前往 GitHub 阅读扩展版本,其中包含更多提示和详细信息。
为了阻止抓取(也称为Webscraping、Screenscraping、Web 数据挖掘、Web 收获或< em>网络数据提取),它有助于了解这些抓取工具的工作原理,以及
,推而广之,是什么阻碍了它们正常工作。
抓取工具有多种类型,每种的工作原理也不同:
蜘蛛程序,例如 Google 机器人或网站复制程序就像 HTtrack 一样,它递归地跟踪其他页面的链接以获取数据。这些有时用于有针对性的抓取以获取特定数据,通常与 HTML 解析器结合使用以从每个页面提取所需的数据。
Shell 脚本:有时,常见的 Unix 工具用于抓取:Wget 或 Curl 下载页面,Grep (Regex) 提取数据。
Shell
HTML 解析器,例如基于 Jsoup、Scrapy 等的解析器。与基于 shell 脚本正则表达式的类似,它们的工作原理是根据 HTML 模式从页面中提取数据,通常会忽略其他所有内容。
例如:如果您的网站有搜索功能,这样的抓取工具可能会提交搜索请求,然后从结果页 HTML 中获取所有结果链接及其标题,以便专门只获取搜索结果链接以及他们的头衔。这些是最常见的。
屏幕截图,基于例如。 Selenium 或 PhantomJS,在真实的浏览器中打开您的网站,运行 JavaScript、AJAX 等,然后从网页中获取所需的文本,通常通过:
在加载页面并运行 JavaScript 后从浏览器获取 HTML,然后使用 HTML 解析器提取所需的数据。这些是最常见的,因此许多破坏 HTML 解析器/抓取器的方法也适用于此。
对渲染的页面进行屏幕截图,然后使用 OCR 从屏幕截图中提取所需的文本。这种情况很少见,只有真正想要您的数据的专门抓取者才会进行设置。
网络抓取服务,例如 ScrapingHub 或 和服。事实上,有些人的工作就是弄清楚如何抓取您的网站并提取内容供其他人使用。
毫不奇怪,专业的抓取服务是最难阻止的,但如果你让弄清楚如何抓取你的网站变得困难且耗时,那么这些(以及付钱给他们这样做的人)可能不会费心去抓取您的网站。
使用框架将您的网站嵌入到其他网站的页面中,并将您的网站嵌入到移动应用程序。
虽然从技术上讲不是抓取,但移动应用程序(Android 和 iOS)可以嵌入网站,并注入自定义 CSS 和 JavaScript,从而完全改变页面的外观。
人工复制 - 粘贴:人们会复制并粘贴您的内容,以便在其他地方使用。
人工复制 - 粘贴:人们会复制并粘贴
这些不同类型的抓取工具之间存在很多重叠,并且许多抓取工具的行为相似,即使它们使用不同的技术和方法。
这些技巧主要是我自己的想法,我在编写爬虫时遇到的各种困难,以及来自互联网的一些信息和想法。
如何停止抓取
你无法完全阻止它,因为无论你做什么,坚定的抓取者仍然可以弄清楚如何抓取。但是,您可以通过执行以下操作来停止大量抓取:
监控日志并查看日志。交通模式;如果发现异常活动,则限制访问:
定期检查日志,如果出现表明自动访问(抓取工具)的异常活动,例如来自同一 IP 地址的许多类似操作,您可以阻止或限制访问。
具体来说,一些想法:
速率限制:
仅允许用户(和抓取工具)在特定时间内执行有限数量的操作 - 例如,每秒仅允许来自任何特定 IP 地址或用户的几次搜索。这会减慢抓取工具的速度,并使它们变得无效。如果操作完成得太快或比真实用户更快,您还可以显示验证码。
检测异常活动:
如果您发现异常活动,例如来自特定 IP 地址的许多类似请求、某人查看过多的页面或执行异常数量的搜索,您可以阻止访问,或为后续请求显示验证码。< /p>
不仅仅监视和管理IP 地址的速率限制 - 也使用其他指标:
如果您确实进行阻止或速率限制,请不要仅针对每个 IP 地址执行此操作;您可以使用其他指标和方法来识别特定用户或抓取者。一些可以帮助您识别特定用户/抓取者的指标包括:
用户填写表单的速度以及他们点击按钮的位置;
您可以使用 JavaScript 收集大量信息,例如屏幕尺寸/分辨率、时区、安装的字体等;您可以使用它来识别用户。
HTTP 标头及其顺序,尤其是 User-Agent。
举个例子,如果您从一个 IP 地址收到许多请求,所有请求都使用相同的用户代理、屏幕大小(由 JavaScript 确定),并且用户(在本例中为抓取工具)始终单击相同的按钮定期地,它可能是一个屏幕抓取工具;并且您可以暂时阻止类似的请求(例如,阻止来自该特定 IP 地址的该用户代理和屏幕尺寸的所有请求),这样您就不会对该 IP 地址上的真实用户造成不便,例如如果共享互联网连接。
您还可以更进一步,因为您可以识别类似的请求,即使它们来自不同的 IP 地址,这表明分布式抓取(使用僵尸网络或代理网络的抓取程序)。如果您收到大量其他相同的请求,但它们来自不同的 IP 地址,您可以阻止。再次强调,请注意不要无意中阻止真实用户。
这对于运行 JavaScript 的屏幕抓取工具非常有效,因为您可以从中获取大量信息。
有关 Security Stack Exchange 的相关问题:
如何唯一标识具有相同外部 IP 地址的用户?了解更多详细信息,以及
当 IP 地址经常发生变化时,为什么人们会使用 IP 地址禁令? 了解有关限制的信息这些方法。
不要暂时阻止访问,而是使用验证码:
实现速率限制的简单方法是暂时阻止访问一段时间,但是使用验证码可能会更好,请参阅下面有关验证码的部分。
需要注册&登录
如果您的网站可行,则需要创建帐户才能查看您的内容。这对于爬虫来说是一个很好的威慑,但对于真正的用户来说也是一个很好的威慑。
为了避免脚本创建许多帐户,您应该:
需要一个电子邮件地址进行注册,并通过发送必须打开才能激活帐户的链接来验证该电子邮件地址。每个电子邮件地址仅允许一个帐户。
需要在注册/创建帐户期间解决验证码。
要求创建帐户才能查看内容会赶走用户和搜索引擎;如果您需要创建帐户才能查看文章,用户将去其他地方。
阻止来自云托管和抓取服务 IP 地址的访问
有时,抓取工具将从 Web 托管服务运行,例如 Amazon Web Services 或 GAE 或 VPS。限制来自此类云托管服务使用的 IP 地址的请求对您网站的访问(或显示验证码)。
同样,您还可以限制来自代理或 VPN 提供商使用的 IP 地址的访问,因为抓取工具可能会使用此类代理服务器来避免检测到许多请求。
请注意,阻止代理服务器和 VPN 的访问将对真实用户产生负面影响。
如果您确实阻止,请使您的错误消息变得不伦不类
如果您确实阻止/限制访问,您应该确保您不会告诉抓取工具导致阻止的原因,从而为他们提供有关如何修复抓取工具的线索。因此,显示带有以下文本的错误页面是一个坏主意:
来自您的 IP 地址的请求太多,请稍后再试。
错误,用户代理标头不存在!
相反,显示一条友好的错误消息,但不会告诉抓取工具导致其发生的原因。像这样的事情要好得多:
[电子邮件受保护]
联系支持人员,如果问题仍然存在。对于真正的用户来说,如果他们看到这样的错误页面,这也更加用户友好。您还应该考虑为后续请求显示验证码而不是硬阻止,以防真实用户看到错误消息,这样您就不会阻止并导致合法用户与您联系。
如果您怀疑您的网站正在被抓取工具访问,请使用验证码。
验证码(“区分计算机和人类的完全自动化测试”)对于阻止抓取工具非常有效。不幸的是,它们也很容易激怒用户。
因此,当您怀疑可能存在抓取工具并希望停止抓取时,它们非常有用,同时又不想阻止访问,以防它不是抓取工具而是真正的用户。如果您怀疑存在抓取工具,您可能需要考虑在允许访问内容之前显示验证码。
使用验证码时需要注意的事项:
不要自行开发验证码,应使用 Google 的 reCaptcha :这比自己实现验证码要容易得多,它比您自己可能想出的一些模糊和扭曲的文本解决方案更用户友好(用户通常只需要勾选一个框),而且对于脚本编写器来解决问题,而不是从您的网站提供简单的图像
不要在 HTML 标记中包含验证码的解决方案:我实际上见过一个网站,其中包含验证码的解决方案 页面本身(尽管隐藏得很好),因此使其毫无用处。不要做这样的事情。再次强调,使用像 reCaptcha 这样的服务,就不会出现此类问题(如果使用得当的话)。
验证码可以批量解决:有一些验证码解决服务,可以由实际的、低薪的人工批量解决验证码。同样,在这里使用 reCaptcha 是一个好主意,因为它们有保护措施(例如用户解决验证码的时间相对较短)。除非您的数据确实有价值,否则不太可能使用这种服务。
将文本内容作为图像提供
您可以将文本渲染到服务器端的图像中,并将其显示出来,这将阻碍简单的抓取工具提取文本。
然而,这对于屏幕阅读器、搜索引擎、性能以及几乎所有其他方面都是不利的。在某些地方它也是非法的(由于可访问性,例如《美国残疾人法案》),并且通过一些 OCR 也很容易规避,所以不要这样做。
您可以对 CSS sprite 执行类似的操作,但会遇到同样的问题。
不要公开完整的数据集:
如果可行,不要为脚本/机器人提供获取所有数据集的方法。举个例子:您有一个新闻网站,其中有很多单独的文章。您可以使这些文章只能通过站点内搜索进行搜索来访问,并且,如果您没有网站上的所有文章及其任意位置的 URL 的列表,这些文章将只能通过使用搜索功能来访问。这意味着想要从您的网站上获取所有文章的脚本必须搜索您的文章中可能出现的所有可能的短语才能找到所有内容,这将非常耗时且效率极低,并且有望使刮刀放弃了。
如果出现以下情况,这将无效:
example.com/article.php?articleId=12345
的 URL 提供的。这(以及类似的事情)将允许抓取工具简单地迭代所有articleId
并以这种方式请求所有文章。不要暴露您的 API、端点和类似的东西:
确保您不会暴露任何 API,即使是无意的。例如,如果您使用 AJAX 或 Adobe Flash 或 Java Applet 中的网络请求(上帝禁止!)来加载数据,那么查看页面中的网络请求并找出这些请求的去向是很简单的,并且然后进行逆向工程并在抓取程序中使用这些端点。确保混淆端点并使其他人难以使用,如上所述。
阻止 HTML 解析器和抓取工具:
由于 HTML 解析器的工作原理是根据 HTML 中可识别的模式从页面中提取内容,因此我们可以故意更改这些模式,以破坏这些抓取工具,甚至破坏它们。这些技巧中的大多数也适用于其他抓取工具,例如蜘蛛和屏幕抓取工具。
经常更改直接处理 HTML 的 HTML
抓取器,方法是从 HTML 页面的特定可识别部分提取内容。例如:如果您网站上的所有页面都有一个 id 为
article-content
的div
,其中包含文章的文本,那么编写脚本就很简单了访问您网站上的所有文章页面,并提取每个文章页面上的article-content
div 的内容文本,瞧,抓取工具以一种格式可以读取您网站上的所有文章在其他地方重复使用。如果您经常更改 HTML 和页面结构,此类抓取工具将不再起作用。
您可以经常更改 HTML 中元素的 ID 和类,甚至可能自动更改。因此,如果您的
div.article-content
变成类似div.a4c36dda13eaf0
的内容,并且每周都会更改,则抓取工具最初会正常工作,但一周后就会损坏。确保也更改 ids/classes 的长度,否则抓取工具将使用div.[any-14-characters]
来查找所需的 div。还要提防其他类似的漏洞。如果无法从标记中找到所需的内容,则抓取工具将从 HTML 的结构方式中找到所需的内容。因此,如果您的所有文章页面都相似,即
h1
之后的div
内的每个div
都是文章内容,则抓取工具将得到文章内容以此为基础。同样,要打破这个问题,您可以定期或随机地向 HTML 添加/删除额外的标记,例如。添加额外的div
或span
。对于现代服务器端 HTML 处理,这应该不会太难。需要注意的事情:
实现、维护和调试将是乏味且困难的。
你会阻碍缓存。特别是如果您更改 HTML 元素的 id 或类,则需要对 CSS 和 JavaScript 文件进行相应更改,这意味着每次更改它们时,浏览器都必须重新下载它们。这将导致重复访问者的页面加载时间更长,并增加服务器负载。如果你每周只改变一次,那不会有什么大问题。
聪明的抓取工具仍然能够通过推断实际内容的位置来获取您的内容,例如。通过知道页面上的一大块文本很可能是实际的文章。这使得仍然可以找到&从页面中提取所需的数据。 Boilerpipe 正是这样做的。
本质上,确保脚本不容易找到每个相似页面的实际所需内容。
另请参阅如何防止依赖 XPath 的爬虫获取页面内容,了解如何在 PHP 中实现这一点的详细信息。
根据用户位置更改 HTML
这与上一个提示有点相似。如果您根据用户的位置/国家/地区(由 IP 地址确定)提供不同的 HTML,则可能会破坏交付给用户的抓取工具。例如,如果有人正在编写一个从您的网站抓取数据的移动应用程序,那么它最初会正常工作,但在实际分发给用户时会崩溃,因为这些用户可能位于不同的国家/地区,因此会获得不同的 HTML,嵌入式刮刀并不是为消耗而设计的。
经常改变你的 HTML,这样做可以积极地对付爬虫!
示例:您的网站上有一个搜索功能,位于
example.com/search?query=somesearchquery
,它返回以下 HTML:正如您可能已经猜到的,这很容易抓取:所有 a scraper 需要做的就是通过查询点击搜索 URL,并从返回的 HTML 中提取所需的数据。除了如上所述定期更改 HTML 之外,您还可以保留带有旧 id 和类的旧标记,用 CSS 隐藏它,并用虚假数据填充它,从而毒害抓取工具。搜索结果页面的更改方式如下:
这意味着根据类或 ID 从 HTML 中提取数据而编写的抓取工具看似会继续工作,但它们会得到虚假数据,甚至是广告,这些数据是真实用户永远不会看到的,因为它们被 CSS 隐藏了。
与抓取工具发生冲突:将虚假的、不可见的蜜罐数据插入到您的页面中
除了前面的示例之外,您还可以将不可见的蜜罐项目添加到 HTML 中以捕获抓取工具。可以添加到前面描述的搜索结果页面的示例:
为获取所有搜索结果而编写的抓取工具将拾取该结果,就像页面上的任何其他真实搜索结果一样,并访问该链接,查找想要的内容。一个真正的人根本不会看到它(因为它被 CSS 隐藏),也不会访问该链接。真正的、理想的蜘蛛(例如 Google 的蜘蛛)也不会访问该链接,因为您在 robots.txt 中不允许
/scrapertrap/
。您可以让您的
scrapertrap.php
执行一些操作,例如阻止访问它的 IP 地址的访问或对来自该 IP 的所有后续请求强制使用验证码。不要忘记在 robots.txt 文件中禁止蜜罐 (
/scrapertrap/
),这样搜索引擎机器人就不会陷入其中。您可以/应该将其与上一个经常更改 HTML 的技巧结合起来。
也要经常更改此设置,因为抓取工具最终会学会避免它。更改蜜罐 URL 和文本。还要考虑更改用于隐藏的内联 CSS,并使用 ID 属性和外部 CSS,因为抓取工具将学会避免任何具有用于隐藏内容的 CSS 的
style
属性。另有时候也只尝试启用它,因此刮板最初工作起来,但一段时间后会断裂。这也适用于先前的提示。恶意人物可以通过共享指向您的蜜罐的链接,甚至嵌入以图像的位置(例如在论坛上)嵌入链接的链接来防止真正用户的访问权限。经常更改URL,并使任何禁令时间相对较短。
如果您检测到刮板,则可以使用伪造的数据
,如果检测到明显的刮板,则可以提供虚假和无用的数据;这将破坏刮板从您的网站获得的数据。您还应该使不可能区分此类假数据与真实数据,以便刮刀不知道它们被搞砸了。
例如:您有一个新闻网站;如果检测到刮板,而不是阻止访问权限,而是伪造,随机构成的文章,这将毒害刮板得到的数据。如果您将假数据与真实的事物无法区分,那么您将很难获得刮刀获得他们想要的东西,即实际的真实数据。
如果用户代理
通常是空 /丢失的,请不要接受请求,懒惰的刮刀不会向用户代理标头发送其请求,而所有浏览器以及搜索引擎蜘蛛都会。
如果您在不存在用户代理标头的地方获取请求,则可以显示验证码,或者简单地阻止或限制访问。 (或者如上所述提供假数据,或其他内容。)
这对欺骗是微不足道的,但是作为反对书面不好的刮板的措施,值得实施。
如果用户代理是常见的刮刀,请勿接受请求; 在某些情况下,刮板使用的黑名单是刮刀使用的
,刮板将使用没有真正的浏览器或搜索引擎蜘蛛使用的用户代理,例如:
如果您发现网站上的刮刀使用了特定的用户代理字符串,并且实际浏览器或合法的蜘蛛不会使用它,则还可以将其添加到黑名单中。
如果不请求资产(CSS,图像),则不是真正的浏览器。
真正的浏览器将(几乎总是)请求并下载图像和CSS等资产。 HTML解析器和刮板不会因为它们仅对实际页面及其内容感兴趣。
您可以将请求记录到您的资产中,如果您只看到HTML的很多请求,则可能是刮板。
提防搜索引擎机器人,古老的移动设备,屏幕读取器和配置错误的设备可能也不请求资产。
使用并需要饼干;使用它们来跟踪用户和刮刀操作。
您可能需要启用cookie才能查看您的网站。这将阻止缺乏经验和新手刮刀作家,但是刮板很容易发送cookie。如果您确实使用并需要它们,则可以与他们跟踪用户和刮刀操作,从而实现限制速率,阻止或以每用户为单位而不是每ip的验证码。
例如:用户执行搜索时,请设置唯一的识别cookie。当查看结果页面时,请验证cookie。如果用户打开所有搜索结果(您可以从cookie中分辨),则可能是刮刀。
使用cookie可能是无效的,因为刮刀也可以随之发送cookie,并根据需要丢弃它们。如果您的网站仅与cookie一起使用,那么您还将防止对残疾人cookie的真实用户访问。
请注意,如果您使用JavaScript设置和检索cookie,则会阻止无法运行JavaScript的刮刀,因为它们无法检索并随着其请求发送cookie。
使用JavaScript + Ajax加载内容,
您可以使用JavaScript + Ajax加载页面本身加载后加载内容。这将使无法运行JavaScript的HTML解析器无法访问内容。这通常是新手和经验不足的程序员撰写刮板的有效威慑。
请注意:
使用JavaScript加载实际内容将降低用户体验和性能
搜索引擎可能无法运行JavaScript,从而阻止他们从索引您的内容。对于搜索结果页面而言,这可能不是问题,而是针对其他内容,例如文章页面。
混淆您的标记,脚本网络请求以及其他所有内容。
如果您使用Ajax和JavaScript加载数据,请混淆传输的数据。例如,您可以在服务器上编码数据(具有像Base64或更复杂的简单的内容),然后在通过AJAX获取后,将其解码并显示在客户端上。这意味着检查网络流量的人不会立即看到您的页面的工作原理和加载数据,并且对于某人直接从您的端点请求数据将更加困难,因为他们必须反向设计器您的描述算法。
如果您确实使用AJAX加载数据,则应该使使用端点很难在不首先加载页面的情况下,例如,将某些会话密钥作为参数,您可以将其嵌入JavaScript或HTML。<<<<<<<<<<<<<<<<<<<<<< /p>
您还可以将混淆的数据直接嵌入到初始的HTML页面中,并使用JavaScript进行DEOBFUSCATE和显示,从而避免额外的网络请求。这样做将使使用不运行JavaScript的只有HTML的解析器提取数据变得更加困难,因为编写刮板的人将不得不将JavaScript逆转工程(您也应该混淆)。
您可能想定期更改混淆方法,以打破弄清楚它的刮刀。
但是,这样做的事情有几个缺点:
这将是乏味的,难以实施,维护和调试。
它将对实际运行JavaScript然后提取数据的刮板和屏幕截图无效。 (虽然最简单的HTML解析器不运行JavaScript)
如果JavaScript禁用了JavaScript,则它将使您的网站对实际用户无功能。
,如果您的JavaScript禁用了JavaScript,
性能和页面加载时间将受到影响。
非技术性:
告诉人们不要刮擦,有些会尊重它
找到律师
您的数据可用,提供API:
您可以使数据容易获得,并需要归因和链接回到您的网站。也许为此收取$$$的费用。
杂项:
也有商业刮擦保护服务,例如Cloudflare或蒸馏网络 (有关其工作原理的详细信息在这里 ),它可以做这些事情,更适合您。
在真实用户的可用性和防刮擦性之间找到平衡:您所做的一切都会以一种或另一种方式对用户体验产生负面影响
不要忘记您的移动网站和应用程序。如果您有一个移动应用程序,则可以进行屏幕截图,并且可以检查网络流量以确定其使用的其余端点。
刮擦器可以刮擦其他刮板:如果有一个网站从您的网站上刮下来,其他刮板可以从该刮刀的网站上刮擦。
进一步阅读:
wikipedia关于网络刮擦的文章。有关涉及技术和不同类型的Web刮板的许多详细信息。
停止脚本脚本,每秒钟猛击您的网站数百次。问答一个非常相似的问题 - 机器人检查网站并在出售后立即购买东西。许多相关信息,特别是。
Note: Since the complete version of this answer exceeds Stack Overflow's length limit, you'll need to head to GitHub to read the extended version, with more tips and details.
In order to hinder scraping (also known as Webscraping, Screenscraping, Web data mining, Web harvesting, or Web data extraction), it helps to know how these scrapers work, and
, by extension, what prevents them from working well.
There's various types of scraper, and each works differently:
Spiders, such as Google's bot or website copiers like HTtrack, which recursively follow links to other pages in order to get data. These are sometimes used for targeted scraping to get specific data, often in combination with a HTML parser to extract the desired data from each page.
Shell scripts: Sometimes, common Unix tools are used for scraping: Wget or Curl to download pages, and Grep (Regex) to extract the data.
HTML parsers, such as ones based on Jsoup, Scrapy, and others. Similar to shell-script regex based ones, these work by extracting data from pages based on patterns in HTML, usually ignoring everything else.
For example: If your website has a search feature, such a scraper might submit a request for a search, and then get all the result links and their titles from the results page HTML, in order to specifically get only search result links and their titles. These are the most common.
Screenscrapers, based on eg. Selenium or PhantomJS, which open your website in a real browser, run JavaScript, AJAX, and so on, and then get the desired text from the webpage, usually by:
Getting the HTML from the browser after your page has been loaded and JavaScript has run, and then using a HTML parser to extract the desired data. These are the most common, and so many of the methods for breaking HTML parsers / scrapers also work here.
Taking a screenshot of the rendered pages, and then using OCR to extract the desired text from the screenshot. These are rare, and only dedicated scrapers who really want your data will set this up.
Webscraping services such as ScrapingHub or Kimono. In fact, there's people whose job is to figure out how to scrape your site and pull out the content for others to use.
Unsurprisingly, professional scraping services are the hardest to deter, but if you make it hard and time-consuming to figure out how to scrape your site, these (and people who pay them to do so) may not be bothered to scrape your website.
Embedding your website in other site's pages with frames, and embedding your site in mobile apps.
While not technically scraping, mobile apps (Android and iOS) can embed websites, and inject custom CSS and JavaScript, thus completely changing the appearance of your pages.
Human copy - paste: People will copy and paste your content in order to use it elsewhere.
There is a lot overlap between these different kinds of scraper, and many scrapers will behave similarly, even if they use different technologies and methods.
These tips mostly my own ideas, various difficulties that I've encountered while writing scrapers, as well as bits of information and ideas from around the interwebs.
How to stop scraping
You can't completely prevent it, since whatever you do, determined scrapers can still figure out how to scrape. However, you can stop a lot of scraping by doing a few things:
Monitor your logs & traffic patterns; limit access if you see unusual activity:
Check your logs regularly, and in case of unusual activity indicative of automated access (scrapers), such as many similar actions from the same IP address, you can block or limit access.
Specifically, some ideas:
Rate limiting:
Only allow users (and scrapers) to perform a limited number of actions in a certain time - for example, only allow a few searches per second from any specific IP address or user. This will slow down scrapers, and make them ineffective. You could also show a captcha if actions are completed too fast or faster than a real user would.
Detect unusual activity:
If you see unusual activity, such as many similar requests from a specific IP address, someone looking at an excessive number of pages or performing an unusual number of searches, you can prevent access, or show a captcha for subsequent requests.
Don't just monitor & rate limit by IP address - use other indicators too:
If you do block or rate limit, don't just do it on a per-IP address basis; you can use other indicators and methods to identify specific users or scrapers. Some indicators which can help you identify specific users / scrapers include:
How fast users fill out forms, and where on a button they click;
You can gather a lot of information with JavaScript, such as screen size / resolution, timezone, installed fonts, etc; you can use this to identify users.
HTTP headers and their order, especially User-Agent.
As an example, if you get many request from a single IP address, all using the same User Agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it's probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won't inconvenience real users on that IP address, eg. in case of a shared internet connection.
You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users.
This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them.
Related questions on Security Stack Exchange:
How to uniquely identify users with the same external IP address? for more details, and
Why do people use IP address bans when IP addresses often change? for info on the limits of these methods.
Instead of temporarily blocking access, use a Captcha:
The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.
Require registration & login
Require account creation in order to view your content, if this is feasible for your site. This is a good deterrent for scrapers, but is also a good deterrent for real users.
In order to avoid scripts creating many accounts, you should:
Require an email address for registration, and verify that email address by sending a link that must be opened in order to activate the account. Allow only one account per email address.
Require a captcha to be solved during registration / account creation.
Requiring account creation to view content will drive users and search engines away; if you require account creation in order to view an article, users will go elsewhere.
Block access from cloud hosting and scraping service IP addresses
Sometimes, scrapers will be run from web hosting services, such as Amazon Web Services or GAE, or VPSes. Limit access to your website (or show a captcha) for requests originating from the IP addresses used by such cloud hosting services.
Similarly, you can also limit access from IP addresses used by proxy or VPN providers, as scrapers may use such proxy servers to avoid many requests being detected.
Beware that by blocking access from proxy servers and VPNs, you will negatively affect real users.
Make your error message nondescript if you do block
If you do block / limit access, you should ensure that you don't tell the scraper what caused the block, thereby giving them clues as to how to fix their scraper. So a bad idea would be to show error pages with text like:
Too many requests from your IP address, please try again later.
Error, User Agent header not present !
Instead, show a friendly error message that doesn't tell the scraper what caused it. Something like this is much better:
[email protected]
, should the problem persist.This is also a lot more user friendly for real users, should they ever see such an error page. You should also consider showing a captcha for subsequent requests instead of a hard block, in case a real user sees the error message, so that you don't block and thus cause legitimate users to contact you.
Use Captchas if you suspect that your website is being accessed by a scraper.
Captchas ("Completely Automated Test to Tell Computers and Humans apart") are very effective against stopping scrapers. Unfortunately, they are also very effective at irritating users.
As such, they are useful when you suspect a possible scraper, and want to stop the scraping, without also blocking access in case it isn't a scraper but a real user. You might want to consider showing a captcha before allowing access to the content if you suspect a scraper.
Things to be aware of when using Captchas:
Don't roll your own, use something like Google's reCaptcha : It's a lot easier than implementing a captcha yourself, it's more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it's also a lot harder for a scripter to solve than a simple image served from your site
Don't include the solution to the captcha in the HTML markup: I've actually seen one website which had the solution for the captcha in the page itself, (although quite well hidden) thus making it pretty useless. Don't do something like this. Again, use a service like reCaptcha, and you won't have this kind of problem (if you use it properly).
Captchas can be solved in bulk: There are captcha-solving services where actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to be used unless your data is really valuable.
Serve your text content as an image
You can render text into an image server-side, and serve that to be displayed, which will hinder simple scrapers extracting text.
However, this is bad for screen readers, search engines, performance, and pretty much everything else. It's also illegal in some places (due to accessibility, eg. the Americans with Disabilities Act), and it's also easy to circumvent with some OCR, so don't do it.
You can do something similar with CSS sprites, but that suffers from the same problems.
Don't expose your complete dataset:
If feasible, don't provide a way for a script / bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don't have a list of all the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.
This will be ineffective if:
example.com/article.php?articleId=12345
. This (and similar things) which will allow scrapers to simply iterate over all thearticleId
s and request all the articles that way.Don't expose your APIs, endpoints, and similar things:
Make sure you don't expose any APIs, even unintentionally. For example, if you are using AJAX or network requests from within Adobe Flash or Java Applets (God forbid!) to load your data it is trivial to look at the network requests from the page and figure out where those requests are going to, and then reverse engineer and use those endpoints in a scraper program. Make sure you obfuscate your endpoints and make them hard for others to use, as described.
To deter HTML parsers and scrapers:
Since HTML parsers work by extracting content from pages based on identifiable patterns in the HTML, we can intentionally change those patterns in oder to break these scrapers, or even screw with them. Most of these tips also apply to other scrapers like spiders and screenscrapers too.
Frequently change your HTML
Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a
div
with an id ofarticle-content
, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of thearticle-content
div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.If you change the HTML and the structure of your pages frequently, such scrapers will no longer work.
You can frequently change the id's and classes of elements in your HTML, perhaps even automatically. So, if your
div.article-content
becomes something likediv.a4c36dda13eaf0
, and changes every week, the scraper will work fine initially, but will break after a week. Make sure to change the length of your ids / classes too, otherwise the scraper will usediv.[any-14-characters]
to find the desired div instead. Beware of other similar holes too..If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every
div
inside adiv
which comes after ah1
is the article content, scrapers will get the article content based on that. Again, to break this, you can add / remove extra markup to your HTML, periodically and randomly, eg. adding extradiv
s orspan
s. With modern server side HTML processing, this should not be too hard.Things to be aware of:
It will be tedious and difficult to implement, maintain, and debug.
You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem.
Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. Boilerpipe does exactly this.
Essentially, make sure that it is not easy for a script to find the actual, desired content for every similar page.
See also How to prevent crawlers depending on XPath from getting page contents for details on how this can be implemented in PHP.
Change your HTML based on the user's location
This is sort of similar to the previous tip. If you serve different HTML based on your user's location / country (determined by IP address), this may break scrapers which are delivered to users. For example, if someone is writing a mobile app which scrapes data from your site, it will work fine initially, but break when it's actually distributed to users, as those users may be in a different country, and thus get different HTML, which the embedded scraper was not designed to consume.
Frequently change your HTML, actively screw with the scrapers by doing so !
An example: You have a search feature on your website, located at
example.com/search?query=somesearchquery
, which returns the following HTML:As you may have guessed this is easy to scrape: all a scraper needs to do is hit the search URL with a query, and extract the desired data from the returned HTML. In addition to periodically changing the HTML as described above, you could also leave the old markup with the old ids and classes in, hide it with CSS, and fill it with fake data, thereby poisoning the scraper. Here's how the search results page could be changed:
This will mean that scrapers written to extract data from the HTML based on classes or IDs will continue to seemingly work, but they will get fake data or even ads, data which real users will never see, as they're hidden with CSS.
Screw with the scraper: Insert fake, invisible honeypot data into your page
Adding on to the previous example, you can add invisible honeypot items to your HTML to catch scrapers. An example which could be added to the previously described search results page:
A scraper written to get all the search results will pick this up, just like any of the other, real search results on the page, and visit the link, looking for the desired content. A real human will never even see it in the first place (due to it being hidden with CSS), and won't visit the link. A genuine and desirable spider such as Google's will not visit the link either because you disallowed
/scrapertrap/
in your robots.txt.You can make your
scrapertrap.php
do something like block access for the IP address that visited it or force a captcha for all subsequent requests from that IP.Don't forget to disallow your honeypot (
/scrapertrap/
) in your robots.txt file so that search engine bots don't fall into it.You can / should combine this with the previous tip of changing your HTML frequently.
Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a
style
attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip.Malicious people can prevent access for real users by sharing a link to your honeypot, or even embedding that link somewhere as an image (eg. on a forum). Change the URL frequently, and make any ban times relatively short.
Serve fake and useless data if you detect a scraper
If you detect what is obviously a scraper, you can serve up fake and useless data; this will corrupt the data the scraper gets from your website. You should also make it impossible to distinguish such fake data from real data, so that scrapers don't know that they're being screwed with.
As an example: you have a news website; if you detect a scraper, instead of blocking access, serve up fake, randomly generated articles, and this will poison the data the scraper gets. If you make your fake data indistinguishable from the real thing, you'll make it hard for scrapers to get what they want, namely the actual, real data.
Don't accept requests if the User Agent is empty / missing
Often, lazily written scrapers will not send a User Agent header with their request, whereas all browsers as well as search engine spiders will.
If you get a request where the User Agent header is not present, you can show a captcha, or simply block or limit access. (Or serve fake data as described above, or something else..)
It's trivial to spoof, but as a measure against poorly written scrapers it is worth implementing.
Don't accept requests if the User Agent is a common scraper one; blacklist ones used by scrapers
In some cases, scrapers will use a User Agent which no real browser or search engine spider uses, such as:
If you find that a specific User Agent string is used by scrapers on your site, and it is not used by real browsers or legitimate spiders, you can also add it to your blacklist.
If it doesn't request assets (CSS, images), it's not a real browser.
A real browser will (almost always) request and download assets such as images and CSS. HTML parsers and scrapers won't as they are only interested in the actual pages and their content.
You could log requests to your assets, and if you see lots of requests for only the HTML, it may be a scraper.
Beware that search engine bots, ancient mobile devices, screen readers and misconfigured devices may not request assets either.
Use and require cookies; use them to track user and scraper actions.
You can require cookies to be enabled in order to view your website. This will deter inexperienced and newbie scraper writers, however it is easy to for a scraper to send cookies. If you do use and require them, you can track user and scraper actions with them, and thus implement rate-limiting, blocking, or showing captchas on a per-user instead of a per-IP basis.
For example: when the user performs search, set a unique identifying cookie. When the results pages are viewed, verify that cookie. If the user opens all the search results (you can tell from the cookie), then it's probably a scraper.
Using cookies may be ineffective, as scrapers can send the cookies with their requests too, and discard them as needed. You will also prevent access for real users who have cookies disabled, if your site only works with cookies.
Note that if you use JavaScript to set and retrieve the cookie, you'll block scrapers which don't run JavaScript, since they can't retrieve and send the cookie with their request.
Use JavaScript + Ajax to load your content
You could use JavaScript + AJAX to load your content after the page itself loads. This will make the content inaccessible to HTML parsers which do not run JavaScript. This is often an effective deterrent to newbie and inexperienced programmers writing scrapers.
Be aware of:
Using JavaScript to load the actual content will degrade user experience and performance
Search engines may not run JavaScript either, thus preventing them from indexing your content. This may not be a problem for search results pages, but may be for other things, such as article pages.
Obfuscate your markup, network requests from scripts, and everything else.
If you use Ajax and JavaScript to load your data, obfuscate the data which is transferred. As an example, you could encode your data on the server (with something as simple as base64 or more complex), and then decode and display it on the client, after fetching via Ajax. This will mean that someone inspecting network traffic will not immediately see how your page works and loads data, and it will be tougher for someone to directly request request data from your endpoints, as they will have to reverse-engineer your descrambling algorithm.
If you do use Ajax for loading the data, you should make it hard to use the endpoints without loading the page first, eg by requiring some session key as a parameter, which you can embed in your JavaScript or your HTML.
You can also embed your obfuscated data directly in the initial HTML page and use JavaScript to deobfuscate and display it, which would avoid the extra network requests. Doing this will make it significantly harder to extract the data using a HTML-only parser which does not run JavaScript, as the one writing the scraper will have to reverse engineer your JavaScript (which you should obfuscate too).
You might want to change your obfuscation methods regularly, to break scrapers who have figured it out.
There are several disadvantages to doing something like this, though:
It will be tedious and difficult to implement, maintain, and debug.
It will be ineffective against scrapers and screenscrapers which actually run JavaScript and then extract the data. (Most simple HTML parsers don't run JavaScript though)
It will make your site nonfunctional for real users if they have JavaScript disabled.
Performance and page-load times will suffer.
Non-Technical:
Tell people not to scrape, and some will respect it
Find a lawyer
Make your data available, provide an API:
You could make your data easily available and require attribution and a link back to your site. Perhaps charge $$$ for it.
Miscellaneous:
There are also commercial scraping protection services, such as the anti-scraping by Cloudflare or Distill Networks (Details on how it works here), which do these things, and more for you.
Find a balance between usability for real users and scraper-proofness: Everything you do will impact user experience negatively in one way or another, find compromises.
Don't forget your mobile site and apps. If you have a mobile app, that can be screenscraped too, and network traffic can be inspected to determine the REST endpoints it uses.
Scrapers can scrape other scrapers: If there's one website which has content scraped from yours, other scrapers can scrape from that scraper's website.
Further reading:
Wikipedia's article on Web scraping. Many details on the technologies involved and the different types of web scraper.
Stopping scripters from slamming your website hundreds of times a second. Q & A on a very similar problem - bots checking a website and buying things as soon as they go on sale. A lot of relevant info, esp. on Captchas and rate-limiting.
我假设您已经设置了
robots.txt
。正如其他人所提到的,抓取工具几乎可以伪造其活动的各个方面,并且可能很难识别来自坏人的请求。
我会考虑:
/jail.html
。robots.txt
中的页面(因此尊敬的蜘蛛永远不会访问)。display: none
)。/jail.html
。这可能会帮助您快速识别来自公然无视您的
robots.txt
的抓取工具的请求。您可能还想让您的
/jail.html
成为一个完整的网站,具有与普通页面相同、精确的标记,但包含虚假数据 (/jail/album/63ajdka、
/jail/track/3aads8
等)。这样,坏的抓取工具就不会收到“异常输入”的警报,直到您有机会完全阻止它们。I will presume that you have set up
robots.txt
.As others have mentioned, scrapers can fake nearly every aspect of their activities, and it is probably very difficult to identify the requests that are coming from the bad guys.
I would consider:
/jail.html
.robots.txt
(so the respectful spiders will never visit).display: none
)./jail.html
.This might help you to quickly identify requests from scrapers that are flagrantly disregarding your
robots.txt
.You might also want to make your
/jail.html
a whole entire website that has the same, exact markup as normal pages, but with fake data (/jail/album/63ajdka
,/jail/track/3aads8
, etc.). This way, the bad scrapers won't be alerted to "unusual input" until you have the chance to block them entirely.起诉他们。
说真的:如果你有钱,请与一位精通互联网的优秀、友善、年轻的律师交谈。你真的可以在这里做点什么。根据网站所在的位置,您可以让律师写一份终止和终止协议。停止或您所在国家/地区的同等规定。你至少可以吓吓那些混蛋。
记录虚拟值的插入。插入清楚(但模糊)指向您的虚拟值。我认为这是电话簿公司的常见做法,在德国,我认为已经有好几次模仿者因1:1复制的虚假条目而被抓获的情况。
如果这会让您搞乱您的 HTML 代码,拖累 SEO、有效性和其他事情,那将是一种耻辱(即使模板系统在每个相同页面的请求上使用稍微不同的 HTML 结构可能已经帮助了 很多反对总是依赖 HTML 结构和类/ID 名称来获取内容的抓取工具。)
像这样的情况正是版权法所擅长的。剽窃他人的诚实作品来赚钱是你应该能够反对的行为。
Sue 'em.
Seriously: If you have some money, talk to a good, nice, young lawyer who knows their way around the Internets. You could really be able to do something here. Depending on where the sites are based, you could have a lawyer write up a cease & desist or its equivalent in your country. You may be able to at least scare the bastards.
Document the insertion of your dummy values. Insert dummy values that clearly (but obscurely) point to you. I think this is common practice with phone book companies, and here in Germany, I think there have been several instances when copycats got busted through fake entries they copied 1:1.
It would be a shame if this would drive you into messing up your HTML code, dragging down SEO, validity and other things (even though a templating system that uses a slightly different HTML structure on each request for identical pages might already help a lot against scrapers that always rely on HTML structures and class/ID names to get the content out.)
Cases like this are what copyright laws are good for. Ripping off other people's honest work to make money with is something that you should be able to fight against.
提供 XML API 来访问您的数据;以简单易用的方式。如果人们想要你的数据,他们就会得到,你还不如全力以赴。
通过这种方式,您可以有效地提供功能子集,至少确保抓取工具不会消耗 HTTP 请求和大量带宽。
然后您所要做的就是说服想要您的数据的人使用 API。 ;)
Provide an XML API to access your data; in a manner that is simple to use. If people want your data, they'll get it, you might as well go all out.
This way you can provide a subset of functionality in an effective manner, ensuring that, at the very least, the scrapers won't guzzle up HTTP requests and massive amounts of bandwidth.
Then all you have to do is convince the people who want your data to use the API. ;)
您确实无法采取任何措施来完全防止这种情况发生。抓取工具可以伪造其用户代理、使用多个 IP 地址等并显示为普通用户。您唯一能做的就是使文本在页面加载时不可用 - 使用图像、Flash 或使用 JavaScript 加载它。然而,前两个都是坏主意,如果某些普通用户没有启用 JavaScript,最后一个将是可访问性问题。
如果他们绝对攻击您的网站并浏览您的所有页面,您可以进行某种速率限制。
不过还是有一些希望的。抓取工具依赖于您网站的数据采用一致的格式。如果你能以某种方式随机化它,它可能会破坏他们的刮刀。诸如在每次加载时更改页面元素的 ID 或类名称等之类的事情。但这需要做很多工作,而且我不确定是否值得。即便如此,他们也可能通过足够的奉献精神来解决这个问题。
There is really nothing you can do to completely prevent this. Scrapers can fake their user agent, use multiple IP addresses, etc. and appear as a normal user. The only thing you can do is make the text not available at the time the page is loaded - make it with image, flash, or load it with JavaScript. However, the first two are bad ideas, and the last one would be an accessibility issue if JavaScript is not enabled for some of your regular users.
If they are absolutely slamming your site and rifling through all of your pages, you could do some kind of rate limiting.
There is some hope though. Scrapers rely on your site's data being in a consistent format. If you could randomize it somehow it could break their scraper. Things like changing the ID or class names of page elements on each load, etc. But that is a lot of work to do and I'm not sure if it's worth it. And even then, they could probably get around it with enough dedication.
抱歉,这真的很难做到...
我建议您礼貌地要求他们不要使用您的内容(如果您的内容受版权保护)。
如果是并且他们不将其删除,那么您可以采取进一步行动并向他们发送停止并停止信。
一般来说,无论你采取什么措施来防止抓取,最终都可能会产生更多的负面影响,例如可访问性、机器人/蜘蛛等。
Sorry, it's really quite hard to do this...
I would suggest that you politely ask them to not use your content (if your content is copyrighted).
If it is and they don't take it down, then you can take furthur action and send them a cease and desist letter.
Generally, whatever you do to prevent scraping will probably end up with a more negative effect, e.g. accessibility, bots/spiders, etc.
好吧,正如所有帖子所说,如果你想让它对搜索引擎友好,那么机器人肯定可以抓取。
但你仍然可以做一些事情,这可能对 60-70% 的抓取机器人有效。
制作如下检查器脚本。
如果某个特定的 IP 地址访问速度非常快,则在几次访问(5-10 次)后,将其 IP 地址 + 浏览器信息放入文件或数据库中。
下一步
(这将是一个后台进程,并且始终运行或在几分钟后计划运行。)制作另一个脚本来继续检查那些可疑的 IP 地址。
情况 1. 如果用户 Agent 是 Google 等已知搜索引擎,Bing,Yahoo(您可以通过谷歌搜索找到有关用户代理的更多信息)。那么您必须看到 http://www.iplists.com/。这个列表并尝试匹配模式。如果它看起来像一个伪造的用户代理,那么要求在下一个页面填写 CAPTCHA访问。 (您需要对机器人 IP 地址进行更多研究。我知道这是可以实现的,并且还可以尝试 IP 地址的 whois。这会很有帮助。)
案例 2. 搜索机器人没有用户代理:只需要求填写一个下次访问时验证码。
Okay, as all posts say, if you want to make it search engine-friendly then bots can scrape for sure.
But you can still do a few things, and it may be affective for 60-70 % scraping bots.
Make a checker script like below.
If a particular IP address is visiting very fast then after a few visits (5-10) put its IP address + browser information in a file or database.
The next step
(This would be a background process and running all time or scheduled after a few minutes.) Make one another script that will keep on checking those suspicious IP addresses.
Case 1. If the user Agent is of a known search engine like Google, Bing, Yahoo (you can find more information on user agents by googling it). Then you must see http://www.iplists.com/. This list and try to match patterns. And if it seems like a faked user-agent then ask to fill in a CAPTCHA on the next visit. (You need to research a bit more on bots IP addresses. I know this is achievable and also try whois of the IP address. It can be helpful.)
Case 2. No user agent of a search bot: Simply ask to fill in a CAPTCHA on the next visit.
迟到的答案 - 而且这个答案可能不是您想听到的......
我自己已经编写了许多(数十个)不同的专业数据挖掘抓取工具。 (只是因为我喜欢“开放数据”哲学)。
其他答案中已经有很多建议 - 现在我将扮演魔鬼代言人的角色并将扩展和/或纠正它们的有效性。
首先:
尝试使用一些技术障碍是不值得的,因为:
Plain HMTL - 最简单的方法是解析纯 HTML 页面,具有明确定义的结构和 css 类。例如,使用 Firebug 检查元素并在我的抓取工具中使用正确的 Xpath 和/或 CSS 路径就足够了。
您可以动态生成 HTML 结构,也可以动态生成 CSS 类名(以及 CSS 本身)(例如,通过使用一些随机类名) - 但
您可以不要改变每个回复的结构,因为你的普通用户会讨厌你。而且,这会给你(维护)带来更多麻烦,而不是给刮刀带来更多麻烦。 XPath 或 CSS 路径可由抓取脚本根据已知内容自动确定。
Ajax - 一开始有点困难,但很多时候加快了抓取过程:) - 为什么?
在分析请求和响应时,我只是设置了自己的代理服务器(用 Perl 编写),我的 Firefox 正在使用它。当然,因为它是我自己的代理 - 它是完全隐藏的 - 目标服务器将其视为常规浏览器。 (因此,没有 X-Forwarded-for 等标头)。
基于代理日志,大多数情况下都可以确定ajax请求的“逻辑”,例如我可以跳过大部分html抓取,而只使用结构良好的ajax响应(主要采用JSON格式)。
因此,ajax 没有多大帮助...
一些更复杂的页面使用很多 打包的 javascript 函数。
这里可以使用两种基本方法:
这种抓取速度很慢(抓取是像在常规浏览器中一样完成的),但它非常
基于用户代理的过滤根本没有帮助。任何认真的数据挖掘者都会在他的抓取工具中将其设置为正确的值。
需要登录 - 没有帮助。最简单的方法(无需任何分析和/或编写登录协议脚本)就是使用 Mozilla 以普通用户身份登录网站,然后运行基于 Mozrepl 的抓取工具...
记住,需要登录< /em> 对匿名机器人有帮助,但对那些想要窃取您数据的人没有帮助。他只是将自己注册为您的网站的普通用户。
使用框架也不是很有效。许多现场电影服务都使用这种方法,而且它并不难被击败。这些框架只是分析所需的另一个 HTML/Javascript 页面...如果数据值得麻烦 - 数据挖掘器将进行所需的分析。
基于 IP 的限制根本无效 - 这里有太多的公共代理服务器,而且这里还有 TOR...:) 它不会减慢抓取速度(对于那些真的想要你的数据)。
抓取图像中隐藏的数据非常困难。 (例如,简单地将数据转换为服务器端图像)。使用“tesseract”(OCR)在很多时候都有帮助 - 但老实说 - 数据必须值得爬虫的麻烦。 (很多时候这是不值得的)。
另一方面,你的用户会因此讨厌你。我自己(即使不抓取)讨厌不允许将页面内容复制到剪贴板的网站(因为信息在图像中,或者(愚蠢的)尝试将右键单击某些自定义 Javascript 事件绑定到。: )
最难的是使用java applet 或flash 的站点,并且applet在内部使用安全https 请求本身。但请三思而后行 - 您的 iPhone 用户将会多么高兴......;)。因此,目前使用它们的网站很少。我自己,阻止我的浏览器中的所有 Flash 内容(在常规浏览会话中) - 并且从不使用依赖于 Flash 的网站。
您的里程碑可能是......,因此您可以尝试此方法 - 请记住 - 您可能会失去一些用户。另请记住,某些 SWF 文件是可反编译的。 ;)
验证码(好的验证码 - 例如 reCaptcha)有很大帮助 - 但您的用户会讨厌您... - 想象一下,当您的用户需要解决所有页面中的一些验证码时,他们会多么喜欢您显示有关音乐艺术家的信息。
可能不需要继续——你已经进入了画面。
现在您应该做什么:
请记住:如果您另一方希望将数据发布(以友好的方式)给您的常规用户,那么隐藏您的数据几乎是不可能的。
因此,
在尝试使用某些技术之前请三思而后行障碍。
与其尝试阻止数据挖掘者,不如为网站的可用性付出更多努力。您的用户会喜欢您的。投入到技术障碍上的时间(和精力)通常是不值得的 - 最好花时间制作更好的网站......
而且,数据窃贼不像普通的窃贼。
如果您购买便宜的家庭警报器并添加警告“这所房子已与警察相连” - 许多小偷甚至不会尝试闯入。因为他的一个错误举动 - 他会进监狱......
所以,你只投资了很少的钱,但小偷投资和冒很大的风险。
但数据窃贼却没有这样的风险。恰恰相反——如果你做出了一个错误的举动(例如,如果你由于技术障碍而引入了一些BUG),你就会失去你的用户。如果抓取机器人第一次无法工作,则不会发生任何事情 - 数据挖掘器只会尝试另一种方法和/或调试脚本。
在这种情况下,您需要投入更多,而爬虫投资则少得多。
想想你想把时间投入到哪里?能量...
Ps:英语不是我的母语 - 所以请原谅我蹩脚的英语...
Late answer - and also this answer probably isn't the one you want to hear...
Myself already wrote many (many tens) of different specialized data-mining scrapers. (just because I like the "open data" philosophy).
Here are already many advices in other answers - now i will play the devil's advocate role and will extend and/or correct their effectiveness.
First:
Trying to use some technical barriers aren't worth the troubles, caused:
Plain HMTL - the easiest way is parse the plain HTML pages, with well defined structure and css classes. E.g. it is enough to inspect element with Firebug, and use the right Xpaths, and/or CSS path in my scraper.
You could generate the HTML structure dynamically and also, you can generate dynamically the CSS class-names (and the CSS itself too) (e.g. by using some random class names) - but
You can't change the structure for every response, because your regular users will hate you. Also, this will cause more troubles for you (maintenance) not for the scraper. The XPath or CSS path is determinable by the scraping script automatically from the known content.
Ajax - little bit harder in the start, but many times speeds up the scraping process :) - why?
When analyzing the requests and responses, i just setup my own proxy server (written in perl) and my firefox is using it. Of course, because it is my own proxy - it is completely hidden - the target server see it as regular browser. (So, no X-Forwarded-for and such headers).
Based on the proxy logs, mostly is possible to determine the "logic" of the ajax requests, e.g. i could skip most of the html scraping, and just use the well-structured ajax responses (mostly in JSON format).
So, the ajax doesn't helps much...
Some more complicated are pages which uses much packed javascript functions.
Here is possible to use two basic methods:
Such scraping is slow (the scraping is done as in regular browser), but it is
The User-Agent based filtering doesn't helps at all. Any serious data-miner will set it to some correct one in his scraper.
Require Login - doesn't helps. The simplest way beat it (without any analyze and/or scripting the login-protocol) is just logging into the site as regular user, using Mozilla and after just run the Mozrepl based scraper...
Remember, the require login helps for anonymous bots, but doesn't helps against someone who want scrape your data. He just register himself to your site as regular user.
Using frames isn't very effective also. This is used by many live movie services and it not very hard to beat. The frames are simply another one HTML/Javascript pages what are needed to analyze... If the data worth the troubles - the data-miner will do the required analyze.
IP-based limiting isn't effective at all - here are too many public proxy servers and also here is the TOR... :) It doesn't slows down the scraping (for someone who really wants your data).
Very hard is scrape data hidden in images. (e.g. simply converting the data into images server-side). Employing "tesseract" (OCR) helps many times - but honestly - the data must worth the troubles for the scraper. (which many times doesn't worth).
On the other side, your users will hate you for this. Myself, (even when not scraping) hate websites which doesn't allows copy the page content into the clipboard (because the information are in the images, or (the silly ones) trying to bond to the right click some custom Javascript event. :)
The hardest are the sites which using java applets or flash, and the applet uses secure https requests itself internally. But think twice - how happy will be your iPhone users... ;). Therefore, currently very few sites using them. Myself, blocking all flash content in my browser (in regular browsing sessions) - and never using sites which depends on Flash.
Your milestones could be..., so you can try this method - just remember - you will probably loose some of your users. Also remember, some SWF files are decompilable. ;)
Captcha (the good ones - like reCaptcha) helps a lot - but your users will hate you... - just imagine, how your users will love you when they need solve some captchas in all pages showing informations about the music artists.
Probably don't need to continue - you already got into the picture.
Now what you should do:
Remember: It is nearly impossible to hide your data, if you on the other side want publish them (in friendly way) to your regular users.
So,
Think twice before you will try to use some technical barriers.
Rather as trying block the data-miners, just add more efforts to your website usability. Your user will love you. The time (&energy) invested into technical barriers usually aren't worth - better to spend the time to make even better website...
Also, data-thieves aren't like normal thieves.
If you buy an inexpensive home alarm and add an warning "this house is connected to the police" - many thieves will not even try to break into. Because one wrong move by him - and he going to jail...
So, you investing only few bucks, but the thief investing and risk much.
But the data-thief hasn't such risks. just the opposite - ff you make one wrong move (e.g. if you introduce some BUG as a result of technical barriers), you will loose your users. If the the scraping bot will not work for the first time, nothing happens - the data-miner just will try another approach and/or will debug the script.
In this case, you need invest much more - and the scraper investing much less.
Just think where you want invest your time & energy...
Ps: english isn't my native - so forgive my broken english...
可能对初学者抓取工具不利的事情:
总体上有帮助的事情:
< strong>有帮助但会让用户讨厌你的事情:
Things that might work against beginner scrapers:
Things that will help in general:
Things that will help but will make your users hate you:
我进行了大量的网页抓取工作,并在博客上总结了一些阻止网页抓取工具的技术基于我觉得烦人的事情。
这是用户和抓取者之间的权衡。如果您限制 IP、使用验证码、要求登录等,就会给抓取工具带来困难。但这也可能会赶走你的真正用户。
I have done a lot of web scraping and summarized some techniques to stop web scrapers on my blog based on what I find annoying.
It is a tradeoff between your users and scrapers. If you limit IP's, use CAPTCHA's, require login, etc, you make like difficult for the scrapers. But this may also drive away your genuine users.
从技术角度来看:
只要模拟一下当你同时向他们发出太多查询时 Google 会做什么。这应该会阻止很多事情。
从法律角度来说:
听起来您发布的数据不是专有的。这意味着您正在发布不受版权保护的姓名、统计数据和其他信息。
如果是这种情况,抓取工具不会通过重新分发有关艺术家姓名等信息来侵犯版权。但是,当他们将您的网站加载到内存中时,他们可能会侵犯版权,因为您的网站包含受版权保护的元素(如布局等)。
我建议阅读有关 Facebook 诉 Power.com 的文章,并查看 Facebook 用于阻止屏幕抓取的论点。您可以采取多种合法方式来阻止某人抓取您的网站。它们影响深远且富有想象力。有时法院会接受这些论点。有时他们不这样做。
但是,假设您要发布不受版权保护的公共领域信息,例如名称和基本统计数据……您应该以言论自由和开放数据的名义将其发布。这就是网络的全部内容。
From a tech perspective:
Just model what Google does when you hit them with too many queries at once. That should put a halt to a lot of it.
From a legal perspective:
It sounds like the data you're publishing is not proprietary. Meaning you're publishing names and stats and other information that cannot be copyrighted.
If this is the case, the scrapers are not violating copyright by redistributing your information about artist name etc. However, they may be violating copyright when they load your site into memory because your site contains elements that are copyrightable (like layout etc).
I recommend reading about Facebook v. Power.com and seeing the arguments Facebook used to stop screen scraping. There are many legal ways you can go about trying to stop someone from scraping your website. They can be far reaching and imaginative. Sometimes the courts buy the arguments. Sometimes they don't.
But, assuming you're publishing public domain information that's not copyrightable like names and basic stats... you should just let it go in the name of free speech and open data. That is, what the web's all about.
不幸的是,您最好的选择是相当手动的:寻找您认为表明抓取的流量模式并禁止其 IP 地址。
既然您谈论的是公共网站,那么使该网站对搜索引擎友好也将使该网站对抓取友好。如果搜索引擎可以抓取和抓取您的网站,那么恶意抓取工具也可以。这是一条很好走的路线。
Your best option is unfortunately fairly manual: Look for traffic patterns that you believe are indicative of scraping and ban their IP addresses.
Since you're talking about a public site then making the site search-engine friendly will also make the site scraping-friendly. If a search-engine can crawl and scrape your site then an malicious scraper can as well. It's a fine-line to walk.
当然有可能。为了 100% 成功,请让您的网站离线。
实际上,您可以做一些事情来使抓取变得更加困难。谷歌会进行浏览器检查,以确保您不是机器人抓取搜索结果(尽管这与大多数其他内容一样,可以被欺骗)。
您可以执行诸如在首次连接到站点和随后的点击之间需要几秒钟的操作。我不确定理想的时间是什么或具体如何做,但这是另一个想法。
我确信还有其他一些人有更多的经验,但我希望这些想法至少有所帮助。
Sure it's possible. For 100% success, take your site offline.
In reality you can do some things that make scraping a little more difficult. Google does browser checks to make sure you're not a robot scraping search results (although this, like most everything else, can be spoofed).
You can do things like require several seconds between the first connection to your site, and subsequent clicks. I'm not sure what the ideal time would be or exactly how to do it, but that's another idea.
I'm sure there are several other people who have a lot more experience, but I hope those ideas are at least somewhat helpful.
这可能不是您想要的答案,但为什么要隐藏您想要公开的内容呢?
It is not the answer you probably want, but why hide what you're trying to make public?
您可以采取一些措施来尝试防止屏幕刮擦。有些不是很有效,而另一些(验证码)则很有效,但会妨碍可用性。您还必须记住,它可能会阻碍合法的网站抓取工具,例如搜索引擎索引。
但是,我认为如果您不希望它被删除,那就意味着您也不希望搜索引擎对其建立索引。
您可以尝试以下一些操作:
如果我必须这样做,我可能会结合使用后三种,因为它们可以最大限度地减少对合法用户的不便。然而,你必须接受这样的事实:你无法以这种方式阻止所有人,一旦有人找到了绕过它的方法,他们就可以永远绕过它。我想,当您发现他们的 IP 地址时,您可以尝试阻止它们。
There are a few things you can do to try and prevent screen scraping. Some are not very effective, while others (a CAPTCHA) are, but hinder usability. You have to keep in mind too that it may hinder legitimate site scrapers, such as search engine indexes.
However, I assume that if you don't want it scraped that means you don't want search engines to index it either.
Here are some things you can try:
If I had to do this, I'd probably use a combination of the last three, because they minimise the inconvenience to legitimate users. However, you'd have to accept that you won't be able to block everyone this way and once someone figures out how to get around it, they'll be able to scrape it forever. You could then just try to block their IP addresses as you discover them I guess.
方法一(仅限小型站点):
提供加密/编码数据。
我使用 python(urllib、requests、beautifulSoup 等)浏览网络,发现许多网站提供加密/编码数据,这些数据在任何编程语言中都无法解密,因为加密方法不存在。
我在 PHP 网站中通过加密和最小化输出来实现这一点(警告:这对于大型网站来说不是一个好主意),响应总是混乱的内容。
最小化 PHP 输出的示例(如何缩小 php 页面 html 输出?):
方法二:
如果你不能阻止他们,就用虚假/无用的数据作为回应。
方法三:
阻止常见的抓取用户代理,您会在主要/大型网站中看到这一点,因为不可能使用“python3.4”作为用户代理来抓取它们。
方法四:
确保所有用户标头都是有效的,我有时会提供尽可能多的标头,以使我的抓取工具看起来像一个真实的用户,其中一些甚至像 en-FU 一样不真实或有效:)。
以下是我通常提供的一些标头的列表。
Method One (Small Sites Only):
Serve encrypted / encoded data.
I Scape the web using python (urllib, requests, beautifulSoup etc...) and found many websites that serve encrypted / encoded data that is not decrypt-able in any programming language simply because the encryption method does not exist.
I achieved this in a PHP website by encrypting and minimizing the output (WARNING: this is not a good idea for large sites) the response was always jumbled content.
Example of minimizing output in PHP (How to minify php page html output?):
Method Two:
If you can't stop them screw them over serve fake / useless data as a response.
Method Three:
block common scraping user agents, you'll see this in major / large websites as it is impossible to scrape them with "python3.4" as you User-Agent.
Method Four:
Make sure all the user headers are valid, I sometimes provide as many headers as possible to make my scraper seem like an authentic user, some of them are not even true or valid like en-FU :).
Here is a list of some of the headers I commonly provide.
也许您应该将机器人列入白名单,而不是将它们列入黑名单。如果您不想删除前几个引擎的搜索结果,您可以将其用户代理字符串列入白名单,这些字符串通常是广为人知的。不太道德的机器人往往会伪造流行网络浏览器的用户代理字符串。前几个搜索引擎应该会带来 95% 以上的流量。
使用其他发帖者建议的技术来识别机器人本身应该相当简单。
Rather than blacklisting bots, maybe you should whitelist them. If you don't want to kill your search results for the top few engines, you can whitelist their user-agent strings, which are generally well-publicized. The less ethical bots tend to forge user-agent strings of popular web browsers. The top few search engines should be driving upwards of 95% of your traffic.
Identifying the bots themselves should be fairly straightforward, using the techniques other posters have suggested.
解决这个问题的快速方法是设置一个诱杀装置/机器人陷阱。
创建一个页面,如果它被打开一定次数甚至根本没有打开,就会收集某些信息,例如 IP 等(您也可以考虑不规则或模式,但该页面根本不应该被打开) )。
在您的页面中创建一个指向该链接的链接,该链接通过 CSS display:none 隐藏;或左:-9999px;位置:绝对;尝试将其放置在不太可能被忽略的地方,例如内容所在的位置,而不是页脚的位置,因为有时机器人可能会选择忘记页面的某些部分。
在您的 robots.txt 文件中,为您不希望友好机器人(哈哈,就像他们有笑脸!)收集信息的页面设置一大堆禁止规则,并将此页面设置为其中之一。
现在,如果友好的机器人通过,它应该忽略该页面。是的,但这仍然不够好。创建更多这样的页面或以某种方式重新路由页面以接受不同的名称。然后在 robots.txt 文件中将更多禁止规则放置到这些陷阱页面以及您想要忽略的页面旁边。
收集这些机器人或进入这些页面的任何人的 IP,不要禁止它们,而是创建一个功能来在您的内容中显示面条文本,例如随机数、版权声明、特定文本字符串、显示恐怖图片,基本上任何内容阻碍你的好内容。您还可以设置指向需要永远加载的页面的链接。在 php 中,您可以使用 sleep() 函数。如果爬虫有某种检测来绕过加载时间太长的页面,那么这将对抗爬虫,因为一些编写良好的机器人被设置为一次处理 X 个链接。
如果您创建了特定的文本字符串/句子,为什么不转到您最喜欢的搜索引擎并搜索它们,它可能会向您显示您的内容的最终结果。
不管怎样,如果你有战术和创造性的思考,这可能是一个很好的起点。最好的办法就是了解机器人的工作原理。
我还会考虑欺骗一些 ID 或页面元素上属性的显示方式:
每次都会更改其形式,因为某些机器人可能被设置为在页面或目标元素中寻找特定模式。
Quick approach to this would be to set a booby/bot trap.
Make a page that if it's opened a certain amount of times or even opened at all, will collect certain information like the IP and whatnot (you can also consider irregularities or patterns but this page shouldn't have to be opened at all).
Make a link to this in your page that is hidden with CSS display:none; or left:-9999px; positon:absolute; try to place it in places that are less unlikely to be ignored like where your content falls under and not your footer as sometimes bots can choose to forget about certain parts of a page.
In your robots.txt file set a whole bunch of disallow rules to pages you don't want friendly bots (LOL, like they have happy faces!) to gather information on and set this page as one of them.
Now, If a friendly bot comes through it should ignore that page. Right but that still isn't good enough. Make a couple more of these pages or somehow re-route a page to accept differnt names. and then place more disallow rules to these trap pages in your robots.txt file alongside pages you want ignored.
Collect the IP of these bots or anyone that enters into these pages, don't ban them but make a function to display noodled text in your content like random numbers, copyright notices, specific text strings, display scary pictures, basically anything to hinder your good content. You can also set links that point to a page which will take forever to load ie. in php you can use the sleep() function. This will fight the crawler back if it has some sort of detection to bypass pages that take way too long to load as some well written bots are set to process X amount of links at a time.
If you have made specific text strings/sentences why not go to your favorite search engine and search for them, it might show you where your content is ending up.
Anyway, if you think tactically and creatively this could be a good starting point. The best thing to do would be to learn how a bot works.
I'd also think about scambling some ID's or the way attributes on the page element are displayed:
that changes its form every time as some bots might be set to be looking for specific patterns in your pages or targeted elements.
您无法停止正常的屏幕抓取。无论好坏,这都是网络的本质。
您可以做到这一点,这样任何人都无法访问某些内容(包括音乐文件),除非他们以注册用户身份登录。这并不太难 在 Apache 中执行。我想在 IIS 中也不会太难。
You can't stop normal screen scraping. For better or worse, it's the nature of the web.
You can make it so no one can access certain things (including music files) unless they're logged in as a registered user. It's not too difficult to do in Apache. I assume it wouldn't be too difficult to do in IIS as well.
大多数都已经说了,但是您考虑过 CloudFlare 保护吗?我的意思是:
其他公司也可能这样做,CloudFlare 是我所知道的唯一一家。
我很确定这会让他们的工作变得复杂。我也曾经因为速率限制(我使用了简单的 AJAX 请求循环)而试图废弃受 CloudFlare 保护的站点的数据,因此 IP 被自动禁止 4 个月。
Most have been already said, but have you considered the CloudFlare protection? I mean this:
Other companies probably do this too, CloudFlare is the only one I know.
I'm pretty sure that would complicate their work. I also once got IP banned automatically for 4 months when I tried to scrap data of a site protected by CloudFlare due to rate limit (I used simple AJAX request loop).
一种方法是将内容作为 XML 属性、URL 编码字符串、带有 HTML 编码 JSON 的预格式化文本或数据 URI 提供,然后在客户端将其转换为 HTML。以下是一些执行此操作的网站:
Skechers:XML
<前><代码><文档
文件名=“”
高度=“”
宽度=“”
标题=“斯凯奇”
链接类型=“”
链接地址=“”
图像映射=“”
href="http://www.bobsfromskechers.com"
alt="Skechers 的 BOBS"
title="Skechers 的 BOBS"
>>
Chrome 网上应用店:JSON
Bing 新闻:数据 URL
Protopage:URL 编码字符串
TiddlyWiki:HTML 实体 + 预格式化 JSON
<前><代码> <前>
{“tiddlers”:
{
“入门”:
{
“标题”:“入门”,
“text”:“欢迎来到 TiddlyWiki,
}
}
}
亚马逊:延迟加载
XMLCalabash:命名空间 XML + 自定义 MIME 类型 + 自定义文件扩展名
如果您查看上述任何内容的源代码,您会发现抓取只会返回元数据和导航。
One way would be to serve the content as XML attributes, URL encoded strings, preformatted text with HTML encoded JSON, or data URIs, then transform it to HTML on the client. Here are a few sites which do this:
Skechers: XML
Chrome Web Store: JSON
Bing News: data URL
Protopage: URL Encoded Strings
TiddlyWiki : HTML Entities + preformatted JSON
Amazon: Lazy Loading
XMLCalabash: Namespaced XML + Custom MIME type + Custom File extension
If you view source on any of the above, you see that scraping will simply return metadata and navigation.
我同意上面的大多数帖子,并且我想补充一点,您的网站对搜索引擎越友好,它就越容易被抓取。您可以尝试做一些非常明显的事情,让抓取者更难,但它也可能会影响您的搜索能力......当然,这取决于您希望您的网站在搜索引擎上的排名如何。
I agree with most of the posts above, and I'd like to add that the more search engine friendly your site is, the more scrape-able it would be. You could try do a couple of things that are very out there that make it harder for scrapers, but it might also affect your search-ability... It depends on how well you want your site to rank on search engines of course.
将您的内容置于验证码后面意味着机器人将很难访问您的内容。然而,这会给人类带来不便,因此这可能是不受欢迎的。
Putting your content behind a captcha would mean that robots would find it difficult to access your content. However, humans would be inconvenienced so that may be undesirable.
如果您想查看一个很好的示例,请查看 http://www.bkstr.com/。他们使用 aj/s 算法来设置 cookie,然后重新加载页面,以便它可以使用 cookie 来验证请求是否正在浏览器中运行。一个为抓取而构建的桌面应用程序肯定可以解决这个问题,但它会阻止大多数 cURL 类型的抓取。
If you want to see a great example, check out http://www.bkstr.com/. They use a j/s algorithm to set a cookie, then reloads the page so it can use the cookie to validate that the request is being run within a browser. A desktop app built to scrape could definitely get by this, but it would stop most cURL type scraping.
屏幕抓取工具通过处理 HTML 来工作。如果他们决心获取您的数据,那么您在技术上无能为力,因为人眼可以处理任何事情。从法律上讲,已经指出您可能有一些追索权,这将是我的建议。
但是,您可以使用基于非 HTML 的表示逻辑来隐藏数据的关键部分
请记住,这可能会影响您的搜索排名。
Screen scrapers work by processing HTML. And if they are determined to get your data there is not much you can do technically because the human eyeball processes anything. Legally it's already been pointed out you may have some recourse though and that would be my recommendation.
However, you can hide the critical part of your data by using non-HTML-based presentation logic
Bear in mind that this would probably affect your search rankings.
生成 HTML、CSS 和 JavaScript。编写生成器比编写解析器更容易,因此您可以以不同的方式生成每个服务页面。然后您不能再使用缓存或静态内容。
Generate the HTML, CSS and JavaScript. It is easier to write generators than parsers, so you could generate each served page differently. You can no longer use a cache or static content then.