如何防止抓取网页上的某些数据?
我只想保护每个请求后显示的某些号码。 这样的数字大约有30个。 我计划在这些数字的位置生成图像,但如果图像不像验证码那样扭曲,脚本是否能够破译该数字? 另外,加载图像与加载文本相比会对性能造成多大影响?
I want to protect only certain numbers that are displayed after each request. There are about 30 such numbers. I was planning to have images generated in the place of those numerbers, but if the image is not warped as with captcha, wont scripts be able to decipher the number anyway? Also, how much of a performance hit would loading images be vs text?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
确保坏人不会获取您的数据的唯一方法是不与任何人共享。 任何其他解决方案本质上都是与屏幕抓取工具进行军备竞赛。 在某一时刻,你们中的一个人会发现军备竞赛代价太大,无法继续下去。 如果您共享的数据具有任何可感知的价值,那么屏幕抓取者可能会非常坚定。
The only way to make sure bad-guys don't get your data is not to share it with anyone. Any other solution is essentially entering an arms race with the screen-scrapers. At one point or another, one of you will find the arms-race too costly to continue. If the data you are sharing has any perceptible value, then probably the screen-scrapers will be very determined.
这是不可能的。
你们正在进行军备竞赛。 您需要做的是让您的信息如此有用,您的页面如此易于使用,从而成为权威来源。 定期更改输出格式以跟上进度也很方便,但屏幕抓取工具可以处理此问题,除非您进行相当彻底的更改。 彻底的改变会让用户离开,因为他们对这个页面一直不熟悉。
您的图像解决方案不会有太大帮助,并且图像的效率要低得多。 在 HTML 编码中,数字通常只有几个字节长。 图像从几百字节开始,根据您想要的大小扩展到 1k 或更多。 图像也不会以用户为其浏览器窗口选择的字体呈现,并且对于使用辅助计算设备的人(视障人士)来说毫无用处。
It's not possible.
You're in an arms race. What you need to do is make your information so useful and your pages so easy to use that you become the authority source. It's also handy to change your output formats regularly to keep up, but screen scrapers can handle that unless you make fairly radical changes. Radical changes drive users away because the page is continually unfamiliar to them.
Your image solution wont' help much, and images are far less efficient. A number is usually only a few bytes long in HTML encoding. Images start at a few hundred bytes and expand to a 1k or more depending on how large you want. Images also will not render in the font the user has selected for their browser window, and are useless to people who use assisted computing devices (visually impaired people).
除了图像之外,您还可以使用 JavaScript 或 Flash 显示数字。
您还可以使用 CSS 通过绝对或相对位置的各种组合来定位各个数字。
您还可以使用 JavaScript 来帮助您创建这些 DIV。
重点是要足够混淆,让它变得非常困难。
另一种解决方案是使用分段或单个点的图像,并使用 CSS 重新构建数字的图像,有点像点阵显示。
您可以用这些绝对定位的 DIV 来散布页面的源代码,并再次通过动态创建它们来使重建变得更加困难。
无论如何,您无法阻止坚定的抓取者获取数据:自动化网络浏览器并截取可以输入 OCR 的屏幕截图并不需要花费太多时间。
无论如何,任何人都不需要付给某人几分钱来手动获取数据。
重点是:你的对手(用户?)的决心有多大。
这有点像软件保护业务:让事情变得足够困难以阻止偶然的“盗版”并不太难,而且总的来说这是一个相当好的方法。
但是,如果您提供的数据很有价值,那么您实际上无法采取任何措施来保护它。
你所能做的一切都让它变得足够困难,以至于偶然的“小偷”宁愿继续为你的服务付费,而不是规避它。
Apart from the images, you could display the numbers using JavaScript or flash.
You could also use CSS to position individual digits using various combinations of absolute or relative positions.
You could also use JavaScript to help you create these DIV.
The point is just to obfuscate enough that it becomes really hard.
One more solution is to use images of segments or single dots and re-construct the images of the digits using CSS, a bit like a dot-matrix display.
You could litter the source of the page with these absolutely positioned DIVs and again make it more difficult to reconstruct by creating them dynamically.
At any rate, you can't stop a determined scraper from getting to the data: it doesn't take a lot to automate a web browser and take screenshots that can be fed to an OCR.
There is nothing anyone from paying someone pennies to get the data manually anyway.
The point is: how determined are your opponents (user?).
It's a bit like the software protection business: making things hard enough that you would deter casual 'pirates' is not too hard, and it's a fairly good approach in general.
However, if there is much value in the data you present, there is nothing you can really do to protect it.
All you can do it make it hard enough so that casual 'thieves' will prefer to continue paying for your services rather than circumvent it.
Javascript 可能是最容易实现的,但是你可以发挥真正的创意,通过在无效数字上放置图层、将错误的数字混合到背景中或通过 css 使它们不可见,可以拥有大块数字,其中某些数字是可见的和半随机生成的类名。
Javascript would probably be the easiest to implement, but you could get really creative and have large blocks of numbers with certain ones being viewable by placing layers on top of the invalid numbers, blending the wrong numbers into the background, or making them invisible via css and semi-randomly generated class names.
我不敢相信我正在推广一种常见的恶意软件脚本策略,但是......
您可以将数字编码为在运行时呈现的编码 Javascript。
I can't believe I'm promoting a common malware scripting tactic, but...
You could encode the numbers as encoded Javascript that gets rendered at runtime.
生成包含这些数字的图像并显示该图像。 :-)
Generate an image containing those numbers and display the image. :-)
我认为你们对这些解决方案过于敏感。 Javascript、Capcha、甚至诉讼和 DMCA 流程都无法解决网络抓取和数据盗窃的复杂适应性问题。 您不认为防止恶意机器人和网站抓取的“理想”解决方案应该是实时主动缓解策略吗? 与内容保护网络非常相似。 就说吧。
示例:
IBM - IBM ISS 数据安全服务
DISTIL - www.distil.it
I think you guys are being too reactive with these solutions. Javascript, Capcha, even litigation and the DMCA process don't address the complex adaptive nature of web scraping and data theft. Don't you think the "ideal" solution to prevent malicious bots and website scraping would be something working in a real-time proactive mitigation strategy? Very similar to a Content Protection Network. Just say'n.
Examples:
IBM - IBM ISS Data Security Services
DISTIL - www.distil.it
您能提供更多关于您正在做什么的细节吗? 当然,创建图像而不是转储数字文本会对性能造成影响,但是您每天这样做的频率是多少?
使用 JavaScript 与使用文本相同。 逆向工程很简单。
Can you provide a little more detail on what it is you're doing? Certainly there's a performance hit to create an image instead of dumping out the text of a number, but how often would you be doing this per day?
Using JavaScript is the same as using text. It's trivial to reverse engineer.
使用 Flash 使用动画数字。 它可能不是万无一失的,但它会使其更难破解。
Use animated numbers using flash. It may not be fool proof but it would make it harder to crack.
发布大量虚拟数字并使用外部 CSS 显示正确的数字怎么样? 只要抓取工具不开始解析外部 CSS 即可。
What about posting a lot of dummy numbers and showing the right ones with external CSS? Just as long the scraper doesn't start to parse the external CSS.
不要输出数字,即前缀
为
//
。Don't output the numbers, i.e. prefix
with
//
.对于所有建议使用 Javascript 或 CSS 来混淆数字的人来说,可能有一种解决方法。 Firefox 有一个名为 abduction 的插件。 基本上它的作用是将页面作为图像保存到文件中。 您可以修改此插件来保存图像,然后分析图像以找出试图隐藏的秘密号码。
基本上,如果有足够的动机从页面上删除这些数字,那么就会完成。 否则,只需发布一个常规号码,让您的用户更轻松,这样他们就不必太担心无法复制和粘贴该号码,或由于这种欺骗而导致的其他此类问题。
For all those that recommend using Javascript, or CSS to obfuscate the numbers, well there's probably a way around it. Firefox has a plugin called abduction. Basically what it does is saves the page to a file as an image. You could probably modify this plugin to save the image, and then analyze the image to find out the secret number that is trying to be hidden.
Basically, if there's enough incentive behind scraping these numbers from the page, then it will be done. Otherwise, just post a regular number, and make it easier on your users so they won't have to worry so much about not being able to copy and paste the number, or other such problems the result from this trickery.
只是用 CSS 盒子模型做一些意想不到的和奇怪的(每次都不同)的事情。 强迫他们实际使用浏览器支持的屏幕截图。
just do something unexpected and weird (different every time) w/ CSS box model. Force them to actually use a browser backed screenscraper.
我认为这是不可能的,你可以让他们的工作变得更加困难(按照这里建议的方式使用图像),但这就是你所能做的,如果你不想,你无法阻止一个坚定的人获取数据他们来抓取你的数据,不要发布它,就这么简单......
I don't think this is possible, you can make their job harder (use images as some suggested here) but this is all you can do, you can't stop a determined person from getting the data, if you don't want them to scrape your data, don't publish it, as simple as that ...
假设这些数字经常更新(如果不更新,那么保护它们完全没有意义,因为人类只能手动转录它们),您可以通过限制来限制自动抓取。 自动化脚本必须经常访问您的网站来检查更新,如果您可以限制您赢得的这些检查,而无需诉诸混淆。
有关限制的指示,请参阅 这个问题。
Assuming these numbers are updated often (if they aren't then protecting them is completely moot as a human can just transcribe them by hand) you can limit automated scraping via throttling. An automated script would have to hit your site often to check for updates, if you can limit these checks you win, without resorting to obfuscation.
For pointers on throttling see this question.