在 C# 中查看生成的源代码(在 AJAX/JavaScript 之后)
有没有一种方法可以从 C# 应用程序查看生成的网页源(所有 AJAX 调用和 JavaScript DOM 操作发生后的代码),而无需从代码中打开浏览器?
使用 WebRequest 或 WebClient 对象工作正常,但如果页面大量使用 JavaScript 来更改页面加载时的 DOM,则这些不会提供准确的页面的图片。
我尝试过使用 Selenium 和 Watin UI 测试框架,它们工作得很好,提供了所有 JavaScript 操作完成后出现的生成源。不幸的是,他们通过打开一个实际的网络浏览器来做到这一点,但速度非常慢。我已经实现了一个硒服务器,它将这项工作卸载到另一台机器上,但仍然存在很大的延迟。
是否有一个 .Net 库可以加载和解析页面(如浏览器)并吐出生成的代码?显然,谷歌和雅虎并没有为他们想要抓取的每个页面开放浏览器(当然他们可能比我拥有更多的资源......)。
是否有这样的库,或者我运气不好,除非我愿意剖析开源浏览器的源代码?
解决方案
好的,谢谢大家的帮助。我有一个工作解决方案,它比 Selenium 快 10 倍左右。哇!
感谢这篇来自beansoftware的旧文章我能够使用 System.Windows.Forms.WebBrowser 控件下载页面并解析它,然后向他们提供生成的源代码。即使该控件位于 Windows.Forms 中,您仍然可以从 Asp.Net 运行它(这就是我正在做的事情),只需记住将 System.Window.Forms 添加到您的项目引用中即可。
这段代码有两点值得注意。首先,在新线程中调用 WebBrowser 控件。这是因为它必须在单线程单元。
其次,GenerateSource 变量在两个位置设置。这不是由于明智的设计决策:)我仍在努力,并将在完成后更新此答案。 wb_DocumentCompleted() 被调用多次。首先是在下载初始 HTML 时,然后是在第一轮 JavaScript 完成时再次执行。不幸的是,我正在抓取的网站有 3 个不同的加载阶段。 1) 加载初始 HTML 2) 进行第一轮 JavaScript DOM 操作 3) 暂停半秒,然后进行第二轮 JS DOM 操作。
由于某种原因,第二轮不是由 wb_DocumentCompleted() 函数引起的,但它总是在 wb.ReadyState == Complete 时被捕获。那么为什么不将它从 wb_DocumentCompleted() 中删除呢?我仍然不确定为什么它没有被捕获在那里,而这就是珠子软件文章建议将其放置的地方。我会继续调查。我只是想发布此代码,以便任何感兴趣的人都可以使用它。享受!
using System.Threading;
using System.Windows.Forms;
public class WebProcessor
{
private string GeneratedSource{ get; set; }
private string URL { get; set; }
public string GetGeneratedHTML(string url)
{
URL = url;
Thread t = new Thread(new ThreadStart(WebBrowserThread));
t.SetApartmentState(ApartmentState.STA);
t.Start();
t.Join();
return GeneratedSource;
}
private void WebBrowserThread()
{
WebBrowser wb = new WebBrowser();
wb.Navigate(URL);
wb.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(
wb_DocumentCompleted);
while (wb.ReadyState != WebBrowserReadyState.Complete)
Application.DoEvents();
//Added this line, because the final HTML takes a while to show up
GeneratedSource= wb.Document.Body.InnerHtml;
wb.Dispose();
}
private void wb_DocumentCompleted(object sender,
WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
GeneratedSource= wb.Document.Body.InnerHtml;
}
}
Is there a way to view the generated source of a web page (the code after all AJAX calls and JavaScript DOM manipulations have taken place) from a C# application without opening up a browser from the code?
Viewing the initial page using a WebRequest or WebClient object works ok, but if the page makes extensive use of JavaScript to alter the DOM on page load, then these don't provide an accurate picture of the page.
I have tried using Selenium and Watin UI testing frameworks and they work perfectly, supplying the generated source as it appears after all JavaScript manipulations are completed. Unfortunately, they do this by opening up an actual web browser, which is very slow. I've implemented a selenium server which offloads this work to another machine, but there is still a substantial delay.
Is there a .Net library that will load and parse a page (like a browser) and spit out the generated code? Clearly, Google and Yahoo aren't opening up browsers for every page they want to spider (of course they may have more resources than me...).
Is there such a library or am I out of luck unless I'm willing to dissect the source code of an open source browser?
SOLUTION
Well, thank you everyone for you're help. I have a working solution that is about 10X faster then Selenium. Woo!
Thanks to this old article from beansoftware I was able to use the System.Windows.Forms.WebBrowser control to download the page and parse it, then give em the generated source. Even though the control is in Windows.Forms, you can still run it from Asp.Net (which is what I'm doing), just remember to add System.Window.Forms to your project references.
There are two notable things about the code. First, the WebBrowser control is called in a new thread. This is because it must run on a single threaded apartment.
Second, the GeneratedSource variable is set in two places. This is not due to an intelligent design decision :) I'm still working on it and will update this answer when I'm done. wb_DocumentCompleted() is called multiple times. First when the initial HTML is downloaded, then again when the first round of JavaScript completes. Unfortunately, the site I'm scraping has 3 different loading stages. 1) Load initial HTML 2) Do first round of JavaScript DOM manipulation 3) pause for half a second then do a second round of JS DOM manipulation.
For some reason, the second round isn't cause by the wb_DocumentCompleted() function, but it is always caught when wb.ReadyState == Complete. So why not remove it from wb_DocumentCompleted()? I'm still not sure why it isn't caught there and that's where the beadsoftware article recommended putting it. I'm going to keep looking into it. I just wanted to publish this code so anyone who's interested can use it. Enjoy!
using System.Threading;
using System.Windows.Forms;
public class WebProcessor
{
private string GeneratedSource{ get; set; }
private string URL { get; set; }
public string GetGeneratedHTML(string url)
{
URL = url;
Thread t = new Thread(new ThreadStart(WebBrowserThread));
t.SetApartmentState(ApartmentState.STA);
t.Start();
t.Join();
return GeneratedSource;
}
private void WebBrowserThread()
{
WebBrowser wb = new WebBrowser();
wb.Navigate(URL);
wb.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(
wb_DocumentCompleted);
while (wb.ReadyState != WebBrowserReadyState.Complete)
Application.DoEvents();
//Added this line, because the final HTML takes a while to show up
GeneratedSource= wb.Document.Body.InnerHtml;
wb.Dispose();
}
private void wb_DocumentCompleted(object sender,
WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
GeneratedSource= wb.Document.Body.InnerHtml;
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
它可能使用浏览器的实例(在您的情况下:ie 控件)。您可以轻松地在您的应用程序中使用并打开页面。然后控件将加载它并处理任何 JavaScript。完成此操作后,您可以访问控件 dom 对象并获取“解释的”代码。
it is possibly using an instance of a browser (in you case: the ie control). you can easily use in your app and open a page. the control will then load it and process any javascript. once this is done you can access the controls dom object and get the "interpreted" code.
最好的方法是使用 PhantomJs。那太棒了。 (示例是文章)。
我的解决方案是这样的:
Best way is using PhantomJs. That's Great. (sample of that is Article).
My solution is look like this:
理论上可以,但目前还不行。
我认为目前没有产品或 OSS 项目可以做到这一点。这样的产品需要有自己的 JavaScript 解释器,并且能够准确地模拟它支持的每个浏览器的运行时环境和怪癖。
鉴于您需要准确模拟服务器+浏览器环境的东西才能生成最终页面代码,从长远来看,我认为使用浏览器实例是准确生成最终状态页面的最佳方法。当您考虑到页面加载完成后,页面源仍然可以随着时间的推移在浏览器中从 AJAX/javascript 更改时,这一点尤其正确。
Theoretically yes, but, at present, no.
I don't think there is currently a product or OSS project that does this. Such a product would need to have it's own javascript interpreter and be able to accurately emulate the run-time environment and quirks of every browser it supports.
Given that you need something that accurately emulates the server + browser environment in order to produce the final page code, in the long run, I think that using a browser instance is the best way to accurately generate the page in its final state. This is especially true, when you consider that, after the page load completes, the page sources can still change over time in the browser from AJAX/javascript.