在 C# 中不使用正则表达式获取外部页面并解析元标记?

发布于 2024-10-06 04:32:26 字数 971 浏览 8 评论 0原文

考虑以下代码:

public ActionResult Index(String URLQuery = "http://www.google.com")
    {

        HttpWebRequest webRequest;
        HttpWebResponse webResponse;

        int bufCount = 0;
        byte[] byteBuf = new byte[1024];
        String queryContent = "";


        webRequest = (HttpWebRequest) WebRequest.Create(URLQuery);
        webRequest.Timeout = 10*1000;
        webRequest.KeepAlive = false;
        webRequest.ContentType = "text/html";

        webResponse = (HttpWebResponse) webRequest.GetResponse();

        StreamReader responseStream = new StreamReader(webResponse.GetResponseStream(), System.Text.Encoding.UTF8);

        queryContent = responseStream.ReadToEnd();

        ViewData["StreamResult"] = queryContent;            
        return View();
    }

本质上,这只是抓取一个网页并按原样输出。我想做的是从屏幕上获取结果数据,并解析它,就像 PHP 允许您使用某种内置 DOM 对象/框架一样。我见过许多正则表达式来完成此任务的示例,但我觉得这效率低下,并且会导致太多奇怪的边缘情况,可能会导致我的数据损坏。

这可能吗?我注定要使用正则表达式吗?

Consider the following code:

public ActionResult Index(String URLQuery = "http://www.google.com")
    {

        HttpWebRequest webRequest;
        HttpWebResponse webResponse;

        int bufCount = 0;
        byte[] byteBuf = new byte[1024];
        String queryContent = "";


        webRequest = (HttpWebRequest) WebRequest.Create(URLQuery);
        webRequest.Timeout = 10*1000;
        webRequest.KeepAlive = false;
        webRequest.ContentType = "text/html";

        webResponse = (HttpWebResponse) webRequest.GetResponse();

        StreamReader responseStream = new StreamReader(webResponse.GetResponseStream(), System.Text.Encoding.UTF8);

        queryContent = responseStream.ReadToEnd();

        ViewData["StreamResult"] = queryContent;            
        return View();
    }

Essentially, this simply grabs a web page and spits it out as-is. What I'd like to do is take the resulting fetched data from the screen, and parse it much like PHP allows you to do using some sort of built-in DOM object/framework. I have seen many examples of Regex to accomplish this task but I feel like that is inefficient and results in too many weird edge-cases that might result in corrupt data on my end.

Is this even possible? Am I doomed to use Regex for this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

乱了心跳 2024-10-13 04:32:26

您应该为此使用解析器 - 看起来 HTML 敏捷包 会做您想要的事情。

You should use a parser for this - it looks like HTML agility pack will do what you want.

又爬满兰若 2024-10-13 04:32:26

使用 HtmlAgility Pack 您可以非常轻松地做到这一点。下面是使用 XPath 的示例,较新的版本也支持 Linq 语法,但我个人还没有尝试过。

    StreamReader responseStream = new StreamReader(webResponse.GetResponseStream(), 
                                                   System.Text.Encoding.UTF8);

        queryContent = responseStream.ReadToEnd();
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(queryContent);
        HtmlNode bodyNode = doc.DocumentNode.SelectSingleNode("//body | //BODY");
        /* do processing here */

Using HtmlAgility Pack you can do this very easily. Below a sample using XPath, the newer version does support Linq syntax as well, but I haven't tried that yet personally.

    StreamReader responseStream = new StreamReader(webResponse.GetResponseStream(), 
                                                   System.Text.Encoding.UTF8);

        queryContent = responseStream.ReadToEnd();
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(queryContent);
        HtmlNode bodyNode = doc.DocumentNode.SelectSingleNode("//body | //BODY");
        /* do processing here */
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文