机器人 - 检测 JavaScript 事件

发布于 2024-12-07 11:35:55 字数 273 浏览 4 评论 0原文

机器人可以识别网页的 JavaScript 部分吗？他们是否可以解析网页的源代码（我猜测动态脚本将显示在源代码中）并确定 javascript 事件。

另外，我很好奇机器人除了解析源代码之外是否可以通过任何其他方式来做到这一点。例如，假设有一个脚本，每当用户单击按钮时，它就会用随机字符串填充文本字段。仅通过解析页面源，机器人无法确定字符串是什么（因为只有一个 rand() 函数）。因此，机器人可以以任何方式猜测输入到文本字段中的字符串的实际内容。

PS：我是一名研究生，研究网络机器人检测技术。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

国粹 2024-12-14 11:35:55

首先，机器人不太可能在您的页面上执行任何 JavaScript。互联网上有无数的网页，通常来说，扫描更多网页比徘徊在仅通过 JavaScript 公开内容的网页上尝试解决问题更容易。

其次，通用机器人不会知道您的网页如何工作，也不会知道在您的页面上执行某些操作之前需要将哪些内容放在何处。他们抓取并解析他们发现的内容，寻找感兴趣的东西。如果您的脚本中的 URL 是完整的 URL，他们可能足够聪明，能够找到它。但是，如果 URL 是根据脚本中的各个部分构建的，则任何通用机器人都不太可能知道您的代码正在组合 URL 及其内容。

第三，特定的攻击者可以分析您的页面，找出其工作原理并设计一种方法来规避某些用户操作。但是，只有当某些攻击者决定专门攻击/规避您的网站时才会出现这种情况。没有专门为您的网站编写代码的通用机器人无法做到这一点。这就是验证码类型操作出现的地方，因为脚本很难“读取”图像以从中获取代码，然后必须将其提交到服务器 - 因此即使是为特定目的构建的机器人也无法真正解决问题验证码类型的问题。他们可以使用真人来解决验证码问题，但现在他们开始花钱，而且很少有网站值得这样做。存在许多这些障碍的目的是让规避这些障碍的成本大于进入的好处。窥探者参与其中是为了赚钱，所以当成本超过他们的收入时，他们就会采取其他方式。

第四，你问的是“事件”。请记住，有编程事件（计时器、页面加载事件等），也有用户事件。仅当有浏览器导致编程事件或在类似浏览器的环境中执行 javascript 代码时，才会发生编程事件。用户事件（按键、单击、鼠标移动等）仅在存在与网页交互的类似浏览器的环境并且存在实际用户创建这些事件时才会发生。当机器人读取您的页面时，这些通常都不存在。他们使用服务器端脚本来获取页面并解析它。他们可以以编程方式驱动浏览器加载您的页面并创建一些编程事件，但仍然不会有用户来创建任何用户事件。如果机器人知道它应该模拟什么用户事件（例如按钮单击），它可以驱动受控浏览器执行此操作，但此时，这不是任何形式的通用机器人，而是专门设计的机器人攻击您的网站。如果有人想解决这个问题，那么设计如何从您的网站请求所需内容（这样它就可以只请求方向）可能要容易得多，而不是尝试模拟这样做的实际浏览器页面。

如果您想尝试（从您的网络服务器）检测访问您网页的内容是否可能是机器人或人类，您所能做的就是从一个页面到下一个页面研究他们的访问模式。机器人将以某种程序化方式“抓取”您的网站。

机器人通常具有相当规律的访问模式（每个页面访问之间有一定的时间）。真实用户可能有非常不同的访问模式。
机器人通常不会与页面上的控件（按钮、字段等）进行交互，也不会导致仅在使用这些控件时才会发生的事情，因此在使用这些控件时您可能以编程方式创建的 URL 不会被使用。已访问。
机器人不会知道要遵循页面上的书面指示。他们只会尝试访问在页面中找到的直接链接。
机器人可能会跟踪那些永远不可见的链接——而人类通常不会。因此，如果您看到从主页到页面中的链接的快速访问模式，但始终不可见（CSS 样式规则显示：无），那么不太可能是人为的。它可能是某种程序化代理（例如机器人）。所以，你可以为机器人设置人类不会去的陷阱。

First of all bots are very unlikely to even execute any javascript on your page. With so many zillions of web pages on the internet, it's generally easier to just go scan more web pages than linger trying to solve issues on web pages that only expose content via javascript.

Second, a generic bot is not going to know how your web page works and is not going to know what needs to be where before doing some operation on your page. They scrape and parse what they find looking for things of interest. If a URL was in your script as a full URL, they might be smart enough to find that. But, if a URL was built from pieces in your script, it's extremely unlikely than any generic bot would be able to figure out that your code was putting together a URL and what it would be.

Third, a specific attacker could analyze your page, figure out how it works and design a way to circumvent certain user operations. But, that's only if some attacker decides to specifically attack/circumvent your site. No generic bot that hasn't been specifically code to your site is going to be able to do that. That's where captcha type operations come into the picture because it's very hard for a script to "read" images to get codes out of them that have to then be submitted to a server - so even bots built for a specific purpose can't really solve captcha-type problems. They can use real people to solve captcha problems, but now it's starting to cost them money and few sites would be worth that. The idea with a lot of these obstacles is to just make the cost of circumventing them more than the benefit of getting in. The snoops are in it to make money so they run the other way when it costs more than they can make.

Fourth, you asked about "events". Keep in mind that there are programmatic events (timers, page load events, etc...) and there are user events. Programmatic events will occur only if there is a browser to cause them or if javascript code is executed in a browser-like environment. User events (keys, clicks, mouse movements, etc...) will only occur if there is a browser-like environment in which to interact with the web page and if there is an actual user to create those events. None of these are typically present when a bot reads your page. They use a server-side script to fetch the page and they parse it. They could programmatically drive a browser to load your page and create some of the programmatic events, but there still wouldn't be a user present to create any user events. If a bot knew what user events it was supposed to simulate (button click, for example), it could drive the controlled browser to do that, but at that point, this is not a generic bot in any fashion, but a bot designed specifically to attack your site. If one wants to go to that trouble, it's probably far easier to just engineer how one requests the desired content from your site (so it can just request it direction) rather than try to simulate an actual browser page that does so.

If you want to try to detect (from your web server) if something accessing your web page is likely a bot or human, all you can do is to study their access patterns from one page to the next. Bots will "crawl" your site in some programmatic fashion.

Bots will typically have fairly regular access patterns (a certain amount of time between each page access). Real users are likely to have very different access patterns.
Bots will typically not interact with controls on your page (buttons, fields, etc...) and not cause things to happen that only happen when those controls are used, so no URLs that you might create programmatically when those controls are used would be accessed.
Bots will not know to follow written directions on the page. They will just try to access direct links that they find in the page.
Bots will likely follow links that are never visible - humans usually don't. So, if you see a quick access pattern from your main page to a link that is in your page, but always invisible (CSS style rule display: none), then it's unlikely a human did that. It's probably some programmatic agent (e.g. a bot). So, you can set traps for bots like this that humans won't go to.

回复收藏 0 原文