C# Web 解析冲突

发布于 2024-12-10 18:47:35 字数 1921 浏览 2 评论 0原文

看来我在解析一些 HTML 的简单尝试中遇到了很多问题。作为练习，我正在编写一个多线程网络爬虫，它从要爬网的网站列表开始。这是通过几个类传递的，这些类最终应该将网站的内容返回到我的系统。这看起来相当简单，但我在以下任一任务中都没有运气：

A. 将网站的内容（以字符串格式，从 HttpWebRequest Stream ）转换为 HtmlDocument （无法创建 HtmlDocument 的新实例？没有多大意义...）通过使用 HtmlDocument.Write() 方法。

或

B. 通过 WebBrowser 实例收集 HtmlDocument。

这是我存在的代码，任何建议都会很棒...

    public void Start()
    {
        if (this.RunningThread == null)
        {
            Console.WriteLine( "Executing SiteCrawler for " + SiteRoot.DnsSafeHost);

            this.RunningThread = new Thread(this.Start);
            this.RunningThread.SetApartmentState(ApartmentState.STA);
            this.RunningThread.Start();
        }
        else
        {
            try
            {
                WebBrowser BrowserEmulator = new WebBrowser();
                BrowserEmulator.Navigate(this.SiteRoot);

                HtmlElementCollection LinkCollection = BrowserEmulator.Document.GetElementsByTagName("a");
                List<PageCrawler> PageCrawlerList = new List<PageCrawler>();

                foreach (HtmlElement Link in LinkCollection)
                {
                    PageCrawlerList.Add(new PageCrawler(Link.GetAttribute("href"), true));
                    continue;
                }
                return;
            }
            catch (Exception e)
            {
                throw new Exception("Exception encountered in SiteCrawler: " + e.Message);
            }
        }
    }

该代码在传递“导航”方法时似乎没有执行任何操作。我尝试允许它在新窗口中打开，该窗口会弹出 IE 的新实例，并继续导航到指定的地址，但不会在我的程序跳过导航方法之前进行。我尝试过等待浏览器“不忙”，但它似乎永远不会获得繁忙属性。我尝试通过 Browser.Document.OpenNew() 创建一个新文档，以便我可以使用 WebRequest 流中的数据填充它，但是我确信您可以假设当我尝试访问时会返回空指针异常该声明的“文档”部分。我做了一些研究，这似乎是创建新的 HtmlDocument 的唯一方法。

正如您所看到的，此方法旨在为指定页面中的每个链接启动“PageCrawler”。我确信在使用 HttpWebRequest 并从流中收集数据之后，我可以逐个字符地解析 HTML 以找到所有链接，但这远远超出了完成此操作所需的工作量。

如果有人有任何建议，我们将不胜感激。谢谢。

原文

It seems that Im encountering quite a few problems in a simple attempt to parse some HTML. As practice, I'm writting a mutli-threaded web crawler that starts with a list of sites to crawl. This gets handed down through a few classes, which should eventually return the content of the sites back to my system. This seems rather straightforward, but I've had no luck in either of the following tasks:

A. Convert the content of a website ( In string format, from an HttpWebRequest Stream ) to an HtmlDocument ( Cannot create a new instance of an HtmlDocument? Doesn't make much sense... ) by using the HtmlDocument.Write() Method.

B. Collect an HtmlDocument via a WebBrowser instance.

Here is my code as it exists, any advice would be great...

    public void Start()
    {
        if (this.RunningThread == null)
        {
            Console.WriteLine( "Executing SiteCrawler for " + SiteRoot.DnsSafeHost);

            this.RunningThread = new Thread(this.Start);
            this.RunningThread.SetApartmentState(ApartmentState.STA);
            this.RunningThread.Start();
        }
        else
        {
            try
            {
                WebBrowser BrowserEmulator = new WebBrowser();
                BrowserEmulator.Navigate(this.SiteRoot);

                HtmlElementCollection LinkCollection = BrowserEmulator.Document.GetElementsByTagName("a");
                List<PageCrawler> PageCrawlerList = new List<PageCrawler>();

                foreach (HtmlElement Link in LinkCollection)
                {
                    PageCrawlerList.Add(new PageCrawler(Link.GetAttribute("href"), true));
                    continue;
                }
                return;
            }
            catch (Exception e)
            {
                throw new Exception("Exception encountered in SiteCrawler: " + e.Message);
            }
        }
    }

This code seems to do nothing when it passes over the 'Navigate' method. I've attempted allowing it to open in a new window, which pops a new instance of IE, and proceeds to navigate to the specified address, but not before my program steps over the navigate method. I've tried waiting for the browser to be 'not busy', but it never seems to pick up the busy attribute anyway. I've tried creating a new document via the Browser.Document.OpenNew() so that I might populate it with data from a WebRequest stream, however as Im sure you can assume I get back a Null Pointer exception when I try to reach through the 'Document' portion of that statement. I've done some research and this appears to be the only way to create a new HtmlDocument.

As you can see, this method is intended to kick off a 'PageCrawler' for every link in a specified page. I am sure that I could parse through the HTML character by character to find all of the links, after using an HttpWebRequest and collecting the data from the stream, but this is far more work than should be necessary to complete this.

If anyone has any advice it would be greatly appreciated. Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

晒暮凉 2024-12-17 18:47:35

如果这是一个控制台应用程序，那么它将无法工作，因为控制台应用程序没有消息泵（WebBrowser 需要它来处理消息）。

< strong>如果您在 Windows 窗体应用程序中运行此程序，则应处理 DocumentCompleted 事件：

WebBrowser browserEmulator = new WebBrowser();
browserEmulator.DocumentCompleted += OnDocumentCompleted;
browserEmulator.Navigate(this.SiteRoot);

然后实现处理该事件的方法：

private void OnDocCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    WebBrowser wb = sender as WebBrowser;

    if (wb.Document != null)
    {
        List<string> links = new List<string>();

        foreach (HtmlElement element in wb.Document.GetElementsByTagName("a"))
        {
            links.Add(element.GetAttribute("href"));
        }

        foreach (string link in links)
        {
            Console.WriteLine(link);
        }
    }
}

如果您想在以下位置运行此程序：一个控制台应用程序，那么你需要使用下载页面的不同方法。我建议您使用 WebRequest/WebResponse，然后使用 HtmlAgilityPack 解析 HTML。 HtmlAgilityPack 将为您生成一个 HtmlDocument，您可以从那里获取链接。

此外，如果您有兴趣了解有关构建可扩展网络爬虫的更多信息，请查看以下链接：

祝你好运！

If this is a console application, then it will not work since the console application doesn't have a message pump (which is required for the WebBrowser to process messages).

If you run this in a Windows Forms application, then you should handle the DocumentCompleted event:

WebBrowser browserEmulator = new WebBrowser();
browserEmulator.DocumentCompleted += OnDocumentCompleted;
browserEmulator.Navigate(this.SiteRoot);

Then implement the method that handles the event:

private void OnDocCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    WebBrowser wb = sender as WebBrowser;

    if (wb.Document != null)
    {
        List<string> links = new List<string>();

        foreach (HtmlElement element in wb.Document.GetElementsByTagName("a"))
        {
            links.Add(element.GetAttribute("href"));
        }

        foreach (string link in links)
        {
            Console.WriteLine(link);
        }
    }
}

If you want to run this in a console application, then you need to use a different method for downloading pages. I would recommend that you use the WebRequest/WebResponse and then use the HtmlAgilityPack to parse the HTML. The HtmlAgilityPack will generate an HtmlDocument for you and you can get the links from there.

Additionally, if you're interested in learning more about building scalable web crawlers, then check out the following links:

Good luck!

回复收藏 0 原文

~没有更多了~