当前位置：文江博客话题详情

C# Internet Explorer 和剥离 HTML 标签

发布于 2025-01-07 05:24:14 字数 138 浏览 0 评论 0原文

有没有办法从 C# 打开 Internet Explorer 进程，将 html 内容发送到该浏览器并捕获“显示”内容？

我知道其他 html 剥离方法（例如 HtmlAgilityPack），但我想探索上述途径。

谢谢， LG

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半夏半凉 2025-01-14 05:24:14

您可以使用 WinForms 和 WPF 中都存在的 WebBrowser 控件在您的应用程序中托管 IE。然后，您可以将控件的 Source 设置为您的 HTML，等待内容加载（使用 LayoutUpdated 事件，而不是 Loaded 事件，该事件在 HTML 下载完成时引发，不一定排列和所有动态 JS 运行），然后访问Document 属性来获取 HTML。

回复收藏 0 原文

玩套路吗 2025-01-14 05:24:14

    public List<LinkItem> getListOfLinksFromPage(string webpage)
    {
        WebClient w = new WebClient();
        List<LinkItem> list = new List<LinkItem>();
        try
        {
            string s = w.DownloadString(webpage);

            foreach (LinkItem i in LinkFinder.Find(s))
            {
                //Debug.WriteLine(i);
                //richTextBox1.AppendText(i.ToString() + "\n");
                list.Add(i);
            }
            listTest = list;
            return list;
        }
        catch (Exception e)
        {
            return list;
        }

    }

    public struct LinkItem
    {
        public string Href;
        public string Text;

        public override string ToString()
        {
            return Href;
        }
    }

    static class LinkFinder
    {
        public static List<LinkItem> Find(string file)
        {
            List<LinkItem> list = new List<LinkItem>();

            // 1.
            // Find all matches in file.
            MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)", RegexOptions.Singleline);

            // 2.
            // Loop over each match.
            foreach (Match m in m1)
            {
                string value = m.Groups[1].Value;
                LinkItem i = new LinkItem();

                // 3.
                // Get href attribute.
                Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
                if (m2.Success)
                {
                    i.Href = m2.Groups[1].Value;
                }

                // 4.
                // Remove inner tags from text.
                string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
                i.Text = t;

                list.Add(i);
            }

            return list;

        }
    }

其他人创建了正则表达式，因此我不能将此归功于此，但上面的代码将打开传入网页的 webclient 对象，并使用正则表达式查找该页面的所有子链接。不确定这是否是您正在寻找的内容，但如果您只是想“抓取”所有 HTML 内容并将其保存到文件中，您可以简单地保存在“string s = w”行中创建的字符串“s” .DownloadString(网页);"到一个文件。

    public List<LinkItem> getListOfLinksFromPage(string webpage)
    {
        WebClient w = new WebClient();
        List<LinkItem> list = new List<LinkItem>();
        try
        {
            string s = w.DownloadString(webpage);

            foreach (LinkItem i in LinkFinder.Find(s))
            {
                //Debug.WriteLine(i);
                //richTextBox1.AppendText(i.ToString() + "\n");
                list.Add(i);
            }
            listTest = list;
            return list;
        }
        catch (Exception e)
        {
            return list;
        }

    }

    public struct LinkItem
    {
        public string Href;
        public string Text;

        public override string ToString()
        {
            return Href;
        }
    }

    static class LinkFinder
    {
        public static List<LinkItem> Find(string file)
        {
            List<LinkItem> list = new List<LinkItem>();

            // 1.
            // Find all matches in file.
            MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)", RegexOptions.Singleline);

            // 2.
            // Loop over each match.
            foreach (Match m in m1)
            {
                string value = m.Groups[1].Value;
                LinkItem i = new LinkItem();

                // 3.
                // Get href attribute.
                Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
                if (m2.Success)
                {
                    i.Href = m2.Groups[1].Value;
                }

                // 4.
                // Remove inner tags from text.
                string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
                i.Text = t;

                list.Add(i);
            }

            return list;

        }
    }

Someone else created the regular expressions so i can not take credit for that, but the above code will open a webclient object to the passed in webpage and use regular expressions to find all of the childLinks for that page. Not sure if this is what you are looking for, but if you simply wanted to "grab" all of that HTML content and save it to a file, you could simply save the string "s" created in the line "string s = w.DownloadString(webpage);" to a File.

回复收藏 0 原文

~没有更多了~