C# Internet Explorer 和剥离 HTML 标签

发布于 2025-01-07 05:24:14 字数 138 浏览 0 评论 0原文

有没有办法从 C# 打开 Internet Explorer 进程,将 html 内容发送到该浏览器并捕获“显示”内容?

我知道其他 html 剥离方法(例如 HtmlAgilityPack),但我想探索上述途径。

谢谢, LG

Is there any way to open Internet Explorer process from C#, send html content to this browser and capture 'displayed' content?

I am aware of other html stripping methods (e.g. HtmlAgilityPack) but I would like to explore the above avenue.

Thanks,
LG

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

半夏半凉 2025-01-14 05:24:14

您可以使用 WinForms 和 WPF 中都存在的 WebBrowser 控件在您的应用程序中托管 IE。然后,您可以将控件的 Source 设置为您的 HTML,等待内容加载(使用 LayoutUpdated 事件,而不是 Loaded 事件,该事件在 HTML 下载完成时引发,不一定排列和所有动态 JS 运行),然后访问Document 属性来获取 HTML。

You can use the WebBrowser control, which exists for both WinForms and WPF, to host IE in your application. You can then set the control's Source to your HTML, wait for the content to load (using the LayoutUpdated event, not the Loaded event, which is raised when the HTML is finished downloading, not necessarily arranged and all dynamic JS run), then access the Document property to get the HTML.

玩套路吗 2025-01-14 05:24:14
    public List<LinkItem> getListOfLinksFromPage(string webpage)
    {
        WebClient w = new WebClient();
        List<LinkItem> list = new List<LinkItem>();
        try
        {
            string s = w.DownloadString(webpage);

            foreach (LinkItem i in LinkFinder.Find(s))
            {
                //Debug.WriteLine(i);
                //richTextBox1.AppendText(i.ToString() + "\n");
                list.Add(i);
            }
            listTest = list;
            return list;
        }
        catch (Exception e)
        {
            return list;
        }

    }

    public struct LinkItem
    {
        public string Href;
        public string Text;

        public override string ToString()
        {
            return Href;
        }
    }

    static class LinkFinder
    {
        public static List<LinkItem> Find(string file)
        {
            List<LinkItem> list = new List<LinkItem>();

            // 1.
            // Find all matches in file.
            MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)", RegexOptions.Singleline);

            // 2.
            // Loop over each match.
            foreach (Match m in m1)
            {
                string value = m.Groups[1].Value;
                LinkItem i = new LinkItem();

                // 3.
                // Get href attribute.
                Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
                if (m2.Success)
                {
                    i.Href = m2.Groups[1].Value;
                }

                // 4.
                // Remove inner tags from text.
                string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
                i.Text = t;

                list.Add(i);
            }

            return list;

        }
    }

其他人创建了正则表达式,因此我不能将此归功于此,但上面的代码将打开传入网页的 webclient 对象,并使用正则表达式查找该页面的所有子链接。不确定这是否是您正在寻找的内容,但如果您只是想“抓取”所有 HTML 内容并将其保存到文件中,您可以简单地保存在“string s = w”行中创建的字符串“s” .DownloadString(网页);"到一个文件。

    public List<LinkItem> getListOfLinksFromPage(string webpage)
    {
        WebClient w = new WebClient();
        List<LinkItem> list = new List<LinkItem>();
        try
        {
            string s = w.DownloadString(webpage);

            foreach (LinkItem i in LinkFinder.Find(s))
            {
                //Debug.WriteLine(i);
                //richTextBox1.AppendText(i.ToString() + "\n");
                list.Add(i);
            }
            listTest = list;
            return list;
        }
        catch (Exception e)
        {
            return list;
        }

    }

    public struct LinkItem
    {
        public string Href;
        public string Text;

        public override string ToString()
        {
            return Href;
        }
    }

    static class LinkFinder
    {
        public static List<LinkItem> Find(string file)
        {
            List<LinkItem> list = new List<LinkItem>();

            // 1.
            // Find all matches in file.
            MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)", RegexOptions.Singleline);

            // 2.
            // Loop over each match.
            foreach (Match m in m1)
            {
                string value = m.Groups[1].Value;
                LinkItem i = new LinkItem();

                // 3.
                // Get href attribute.
                Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
                if (m2.Success)
                {
                    i.Href = m2.Groups[1].Value;
                }

                // 4.
                // Remove inner tags from text.
                string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
                i.Text = t;

                list.Add(i);
            }

            return list;

        }
    }

Someone else created the regular expressions so i can not take credit for that, but the above code will open a webclient object to the passed in webpage and use regular expressions to find all of the childLinks for that page. Not sure if this is what you are looking for, but if you simply wanted to "grab" all of that HTML content and save it to a file, you could simply save the string "s" created in the line "string s = w.DownloadString(webpage);" to a File.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文