在 C# 中查看生成的源代码（在 AJAX/JavaScript 之后）

发布于 2024-08-03 07:13:58 字数 2919 浏览 8 评论 0原文

有没有一种方法可以从 C# 应用程序查看生成的网页源（所有 AJAX 调用和 JavaScript DOM 操作发生后的代码），而无需从代码中打开浏览器？

使用 WebRequest 或 WebClient 对象工作正常，但如果页面大量使用 JavaScript 来更改页面加载时的 DOM，则这些不会提供准确的页面的图片。

我尝试过使用 Selenium 和 Watin UI 测试框架，它们工作得很好，提供了所有 JavaScript 操作完成后出现的生成源。不幸的是，他们通过打开一个实际的网络浏览器来做到这一点，但速度非常慢。我已经实现了一个硒服务器，它将这项工作卸载到另一台机器上，但仍然存在很大的延迟。

是否有一个 .Net 库可以加载和解析页面（如浏览器）并吐出生成的代码？显然，谷歌和雅虎并没有为他们想要抓取的每个页面开放浏览器（当然他们可能比我拥有更多的资源......）。

是否有这样的库，或者我运气不好，除非我愿意剖析开源浏览器的源代码？

解决方案

好的，谢谢大家的帮助。我有一个工作解决方案，它比 Selenium 快 10 倍左右。哇！

感谢这篇来自beansoftware的旧文章我能够使用 System.Windows.Forms.WebBrowser 控件下载页面并解析它，然后向他们提供生成的源代码。即使该控件位于 Windows.Forms 中，您仍然可以从 Asp.Net 运行它（这就是我正在做的事情），只需记住将 System.Window.Forms 添加到您的项目引用中即可。

这段代码有两点值得注意。首先，在新线程中调用 WebBrowser 控件。这是因为它必须在单线程单元。

其次，GenerateSource 变量在两个位置设置。这不是由于明智的设计决策:)我仍在努力，并将在完成后更新此答案。 wb_DocumentCompleted() 被调用多次。首先是在下载初始 HTML 时，然后是在第一轮 JavaScript 完成时再次执行。不幸的是，我正在抓取的网站有 3 个不同的加载阶段。 1) 加载初始 HTML 2) 进行第一轮 JavaScript DOM 操作 3) 暂停半秒，然后进行第二轮 JS DOM 操作。

由于某种原因，第二轮不是由 wb_DocumentCompleted() 函数引起的，但它总是在 wb.ReadyState == Complete 时被捕获。那么为什么不将它从 wb_DocumentCompleted() 中删除呢？我仍然不确定为什么它没有被捕获在那里，而这就是珠子软件文章建议将其放置的地方。我会继续调查。我只是想发布此代码，以便任何感兴趣的人都可以使用它。享受！

using System.Threading;
using System.Windows.Forms;

public class WebProcessor
{
    private string GeneratedSource{ get; set; }
    private string URL { get; set; }

    public string GetGeneratedHTML(string url)
    {
        URL = url;

        Thread t = new Thread(new ThreadStart(WebBrowserThread));
        t.SetApartmentState(ApartmentState.STA);
        t.Start();
        t.Join();

        return GeneratedSource;
    }

    private void WebBrowserThread()
    {
        WebBrowser wb = new WebBrowser();
        wb.Navigate(URL);

        wb.DocumentCompleted += 
            new WebBrowserDocumentCompletedEventHandler(
                wb_DocumentCompleted);

        while (wb.ReadyState != WebBrowserReadyState.Complete)
            Application.DoEvents();

        //Added this line, because the final HTML takes a while to show up
        GeneratedSource= wb.Document.Body.InnerHtml;

        wb.Dispose();
    }

    private void wb_DocumentCompleted(object sender, 
        WebBrowserDocumentCompletedEventArgs e)
    {
        WebBrowser wb = (WebBrowser)sender;
        GeneratedSource= wb.Document.Body.InnerHtml;
    }
}

原文

Is there a way to view the generated source of a web page (the code after all AJAX calls and JavaScript DOM manipulations have taken place) from a C# application without opening up a browser from the code?

Viewing the initial page using a WebRequest or WebClient object works ok, but if the page makes extensive use of JavaScript to alter the DOM on page load, then these don't provide an accurate picture of the page.

I have tried using Selenium and Watin UI testing frameworks and they work perfectly, supplying the generated source as it appears after all JavaScript manipulations are completed. Unfortunately, they do this by opening up an actual web browser, which is very slow. I've implemented a selenium server which offloads this work to another machine, but there is still a substantial delay.

Is there a .Net library that will load and parse a page (like a browser) and spit out the generated code? Clearly, Google and Yahoo aren't opening up browsers for every page they want to spider (of course they may have more resources than me...).

Is there such a library or am I out of luck unless I'm willing to dissect the source code of an open source browser?

SOLUTION

Well, thank you everyone for you're help. I have a working solution that is about 10X faster then Selenium. Woo!

Thanks to this old article from beansoftware I was able to use the System.Windows.Forms.WebBrowser control to download the page and parse it, then give em the generated source. Even though the control is in Windows.Forms, you can still run it from Asp.Net (which is what I'm doing), just remember to add System.Window.Forms to your project references.

There are two notable things about the code. First, the WebBrowser control is called in a new thread. This is because it must run on a single threaded apartment.

Second, the GeneratedSource variable is set in two places. This is not due to an intelligent design decision :) I'm still working on it and will update this answer when I'm done. wb_DocumentCompleted() is called multiple times. First when the initial HTML is downloaded, then again when the first round of JavaScript completes. Unfortunately, the site I'm scraping has 3 different loading stages. 1) Load initial HTML 2) Do first round of JavaScript DOM manipulation 3) pause for half a second then do a second round of JS DOM manipulation.

For some reason, the second round isn't cause by the wb_DocumentCompleted() function, but it is always caught when wb.ReadyState == Complete. So why not remove it from wb_DocumentCompleted()? I'm still not sure why it isn't caught there and that's where the beadsoftware article recommended putting it. I'm going to keep looking into it. I just wanted to publish this code so anyone who's interested can use it. Enjoy!

using System.Threading;
using System.Windows.Forms;

public class WebProcessor
{
    private string GeneratedSource{ get; set; }
    private string URL { get; set; }

    public string GetGeneratedHTML(string url)
    {
        URL = url;

        Thread t = new Thread(new ThreadStart(WebBrowserThread));
        t.SetApartmentState(ApartmentState.STA);
        t.Start();
        t.Join();

        return GeneratedSource;
    }

    private void WebBrowserThread()
    {
        WebBrowser wb = new WebBrowser();
        wb.Navigate(URL);

        wb.DocumentCompleted += 
            new WebBrowserDocumentCompletedEventHandler(
                wb_DocumentCompleted);

        while (wb.ReadyState != WebBrowserReadyState.Complete)
            Application.DoEvents();

        //Added this line, because the final HTML takes a while to show up
        GeneratedSource= wb.Document.Body.InnerHtml;

        wb.Dispose();
    }

    private void wb_DocumentCompleted(object sender, 
        WebBrowserDocumentCompletedEventArgs e)
    {
        WebBrowser wb = (WebBrowser)sender;
        GeneratedSource= wb.Document.Body.InnerHtml;
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓝咒 2024-08-10 07:13:59

它可能使用浏览器的实例（在您的情况下：ie 控件）。您可以轻松地在您的应用程序中使用并打开页面。然后控件将加载它并处理任何 JavaScript。完成此操作后，您可以访问控件 dom 对象并获取“解释的”代码。

回复收藏 0 原文

悲欢浪云 2024-08-10 07:13:59

最好的方法是使用 PhantomJs。那太棒了。（示例是文章）。

我的解决方案是这样的：

var page = require('webpage').create();

page.open("https://sample.com", function(){
    page.evaluate(function(){
        var i = 0,
        oJson = jsonData,
        sKey;
        localStorage.clear();

        for (; sKey = Object.keys(oJson)[i]; i++) {
            localStorage.setItem(sKey,oJson[sKey])
        }
    });

    page.open("https://sample.com", function(){
        setTimeout(function(){
         page.render("screenshoot.png") 
            // Where you want to save it    
           console.log(page.content); //page source
            // You can access its content using jQuery
            var fbcomments = page.evaluate(function(){
                return $("body").contents().find(".content") 
            }) 
            phantom.exit();
        },10000)
    });     
});

Best way is using PhantomJs. That's Great. (sample of that is Article).

My solution is look like this:

var page = require('webpage').create();

page.open("https://sample.com", function(){
    page.evaluate(function(){
        var i = 0,
        oJson = jsonData,
        sKey;
        localStorage.clear();

        for (; sKey = Object.keys(oJson)[i]; i++) {
            localStorage.setItem(sKey,oJson[sKey])
        }
    });

    page.open("https://sample.com", function(){
        setTimeout(function(){
         page.render("screenshoot.png") 
            // Where you want to save it    
           console.log(page.content); //page source
            // You can access its content using jQuery
            var fbcomments = page.evaluate(function(){
                return $("body").contents().find(".content") 
            }) 
            phantom.exit();
        },10000)
    });     
});

回复收藏 0 原文