在存在 JS 等的情况下,什么时候网页被视为“已加载”

发布于 2024-10-01 12:57:09 字数 639 浏览 3 评论 0原文

信息:我对javascript一无所知。没有任何。

我很好奇是否有任何方法可以确定网页何时完全加载?假设我有一个爬虫,它使用 webkit 来渲染页面(以及 webkit 的 JS 引擎来解析任何 JS 函数并完成对 DOM 等的处理),我很好奇是否有任何方法可以知道网页何时“完成”加载?我认为要做的事情:

1)所有脚本都已完成执行。 2) 没有待处理的 AJAX 调用。 3) DOM完全根据当前可用的信息进行处理和加载。

对于更具体的假设,通过查看一些网站的源代码,我发现它们通过使用脚本标记来加载广告,该脚本标记将内容注入到 DOM 中,并发出 AJAX 调用来加载和填充广告。如何确定这一切何时完成?

(我想,用任何异步的东西替换这个例子。我只是想不出比上面更通用的东西了。)

我所说的“检测”是指以任何可能的方式。例如,在页面中注入一些 JS 代码,向页面写入一些内容,让我知道事情已经完成。或者例如使用 QtWebkit,JS 可以调用 C++(我相信),因此 JS 片段可以调用 C++ 函数来让它知道页面何时“加载”。简而言之,无论什么有效。

我当前的“天真的”实现只是在加载页面后等待几秒钟。这很愚蠢。

请尽可能详细,如果在我理解答案之前需要更多背景信息,请随时说“先阅读此内容”。

非常感谢!

Information: I have no knowledge of javascript. none.

I'm curious if there's any way to determine when a webpage is completely loaded? Let's say I have a crawler, that uses webkit to render pages (and webkit's JS engine to parse any JS functions and finish processing the DOM etc), I'm curious if there's any way to know when a webpage is 'done' loading? What I consider to be done:

1) All scripts have finished executing.
2) No pending AJAX calls.
3) The DOM is completely processed and loaded based on currently available information.

For a more concrete hypothetical, from looking at the source of a few sites, I see that they load ads by using a script tag that injects stuff into the DOM, and issues AJAX calls to load and populate the ads. How can one determine when all this is done?

(replace the example by anything asynchronous, I guess. I just couldn't think of anything more universal than the above.)

By "detect", I mean, in any manner possible. For instance, injecting a bit of JS code into the page that writes something to the page to let me know stuff is done. Or for instance with QtWebkit, JS can call into C++(i believe), so a JS snippet could call a C++ function to let it know when the page was 'loaded'. Whatever works, in short.

The current 'naive' implementation I have just sits and waits for a few seconds after loading a page. It's stupid.

Please be as detailed as possible, and feel free to say 'read this first' if more background information is required prior to me understanding the answer.

Thank you very much!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

街角迷惘 2024-10-08 12:57:09

通常不可能判断包含异步脚本驱动内容的页面是否真正完成加载。除了停止问题这一基本问题之外,脚本或插件还可以注册周期性计时器事件并继续无限期地修改或添加到页面。

我通常看到的确定页面何时完成加载的方法是加载整个 DOM,加载直接从该 DOM 引用的资源(图像、样式表、脚本等),并且加载所有脚本代码。读取并执行一次。为此目的,通过 document.write() 发出的文本被视为直接包含在源 HTML 中。如果您使用的是 QtWebKit,我相信您在连接到信号 QWebPage::loadFinished(bool) 时将会看到这种行为。 (您可以使用访问器 page()QWebFrame 获取包含的 QWebPage。)

由脚本代码设置的延迟操作,无论是由计时器、等待其他资源加载完成的事件,或者你拥有的东西,不计算在内;媒体播放器和其他插件可能会使事情进一步复杂化,因为每种媒体类型甚至播放器可能对“已加载”的构成有不同的标准。

最近的许多 JavaScript 库利用这种行为来改善感知的页面加载时间,方法是加载仅包含第一个屏幕的内容和一些脚本的不完整页面,并且直到第一个屏幕之后才真正开始加载“首屏下方”的图像和内容一屏左右就完成了加载和渲染。不过,它对于自动化工具、爬虫或那些认为 JavaScript 是受信任网站应获得的特权的人来说并不是很友好。

It's in general impossible to say whether a page that contains asynchronous, script-driven content is truly done loading. Aside from the fundamental issue of the halting problem, it's possible for scripts or plugins to register for periodic timer events and continue modifying or adding to the page indefinitely.

The approach I've usually seen for determining when a page is done loading is when the entire DOM has been loaded, resources (images, stylesheets, scripts, etc.) referenced directly from that DOM have been loaded, and all script code has been read and executed through once. Text emitted via document.write() is treated for this purpose as if it was directly included in the source HTML. If you're using QtWebKit, I believe this is the behavior you will see if you connect to the signal QWebPage::loadFinished(bool). (You can get the contained QWebPage from a QWebFrameusing the accessor page().)

Deferred actions set up by the script code, whether by timers, events waiting for load of other resources to complete, or what have you, is not counted; media players and other plugins may complicate things further because each media type or even player may have a different standard of what constitutes "loaded".

A number of recent JavaScript libraries exploit this behavior to improve perceived page load times by loading an incomplete page containing just the first screen's worth of content plus some script, and not actually beginning to load images and content "below the fold" until after the first screenful or so is done loading and rendering. It's not very friendly to automated tools, crawlers or those who consider JavaScript a privilege to be earned by trusted sites, though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文