将字符串或 html 文件转换为 C# HtmlDocument,无需使用 WebBrowser 或 HAP

发布于 2025-01-07 13:22:51 字数 367 浏览 0 评论 0原文

我能找到的唯一解决方案是使用:

            mshtml.HTMLDocument htmldocu = new mshtml.HTMLDocument();
            htmldocu .createDocumentFromUrl(url, "");

并且我不确定性能,它应该比在 Web 浏览器中加载 html 文件然后从那里获取 HtmlDocument 更好。无论如何,该代码在我的机器上不起作用。当应用程序尝试执行第二行时,它崩溃了。

有没有人有办法有效地实现这一点或任何其他方式?

注意:请理解我需要 HtmlDocument 对象来进行 DOM 处理。我不需要 html 字符串。

The only solution I could find was using:

            mshtml.HTMLDocument htmldocu = new mshtml.HTMLDocument();
            htmldocu .createDocumentFromUrl(url, "");

and I am not sure about the performance, it should be better than loading the html file in a WebBrowser and then grab the HtmlDocument from there. Anyhow, that code does not work on my machine. The application crashes when it tries to execute the second line.

Has anyone an approach to achieve this efficiently or any other way?

NOTE: Please understand that I need the HtmlDocument object for DOM processing. I do not need the html string.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

欢你一世 2025-01-14 13:22:51

使用 WebClient 对象的 DownloadString 方法。例如

WebClient client = new WebClient();
string reply = client.DownloadString("http://www.google.com");

,在上面的示例中,执行后,reply 将包含端点 http://www.google.com 的 html 标记。

WebClient.DownloadString MSDN

Use the DownloadString method of the WebClient object. e.g.

WebClient client = new WebClient();
string reply = client.DownloadString("http://www.google.com");

In the above example, after executed, reply will contain the html markup of the endpoint http://www.google.com.

WebClient.DownloadString MSDN

茶花眉 2025-01-14 13:22:51

为了回答您四年前(在我发布此答案时)的实际问题,我提供了一个可行的解决方案。如果您找到另一种方法来做到这一点,我也不会感到惊讶,所以这主要适用于寻找类似解决方案的其他人。但请记住,这被认为

  1. 有些过时(HtmlDocument 的实际使用),
  2. 而不是处理 HTML DOM 解析的最佳方法(首选解决方案是使用 HtmlAgilityPack 或 CsQuery 或使用其他一些方法)实际的解析而不是正则表达式)
  3. 极其很老套,因此不是最安全/最兼容的方法,
  4. 你真的不应该这样做我即将展示

此外,请记住,HtmlDocument 实际上只是 mshtml.HTMLDocument2 的包装器,因此它技术上比只是直接使用 COM 包装器,但我完全理解用例只是为了便于编码。

如果您对以上所有内容都很满意,那么以下是如何实现您想要的。

public class HtmlDocumentFactory
{
  private static Type htmlDocType = typeof(System.Windows.Forms.HtmlDocument);
  private static Type htmlShimManagerType = null;
  private static object htmlShimSingleton = null;
  private static ConstructorInfo docCtor = null;

  public static HtmlDocument Create()
  {
    if (htmlShimManagerType == null)
    {
      // get a type reference to HtmlShimManager
      htmlShimManagerType = htmlDocType.Assembly.GetType(
        "System.Windows.Forms.HtmlShimManager"
        );
      // locate the necessary private constructor for HtmlShimManager
      var shimCtor = htmlShimManagerType.GetConstructor(
        BindingFlags.NonPublic | BindingFlags.Instance, null, new Type[0], null
        );
      // create a new HtmlShimManager object and keep it for the rest of the
      // assembly instance
      htmlShimSingleton = shimCtor.Invoke(null);
    }

    if (docCtor == null)
    {
      // get the only constructor for HtmlDocument (which is marked as private)
      docCtor = htmlDocType.GetConstructors(
        BindingFlags.NonPublic | BindingFlags.Instance
        )[0];
    }

    // create an instance of mshtml.HTMLDocument2 (in the form of 
    // IHTMLDocument2 using HTMLDocument2's class ID)
    object htmlDoc2Inst = Activator.CreateInstance(Type.GetTypeFromCLSID(
      new Guid("25336920-03F9-11CF-8FD0-00AA00686F13")
      ));
    var argValues = new object[] { htmlShimSingleton, htmlDoc2Inst };
    // create a new HtmlDocument without involving WebBrowser
    return (HtmlDocument)docCtor.Invoke(argValues);
  }
}

要使用它:

var htmlDoc = HtmlDocumentFactory.Create();
htmlDoc.Write("<html><body><div>Hello, world!</body></div></html>");
Console.WriteLine(htmlDoc.Body.InnerText);
// output:
// Hello, world!

我没有直接测试此代码 - 我已将其从旧的 Powershell 脚本翻译而来,该脚本需要与您请求的功能相同的功能。如果失败,请告诉我。功能已经存在,但代码可能需要非常小的调整才能工作。

In an attempt to answer your actual question from four years ago (at the time of me posting this answer), I'm providing a working solution. I wouldn't be surprised if you found another way to do this, either, so this is mostly for other people searching for a similar solution. Keep in mind, however, that this is considered

  1. somewhat obsolete (the actual use of HtmlDocument)
  2. not the best way to handle HTML DOM parsing (the preferred solution is to use HtmlAgilityPack or CsQuery or some other method using actual parsing and not regular expressions)
  3. extremely hacky and therefore not the safest/most compatible way to do it
  4. you really should not be doing what I'm about to show

Additionally, keep in mind that HtmlDocument is really just a wrapper for mshtml.HTMLDocument2, so it is technically slower than just using a COM wrapper directly, but I completely understand the use case simply for ease of coding.

If you're cool with all of the above, here's how to accomplish what you want.

public class HtmlDocumentFactory
{
  private static Type htmlDocType = typeof(System.Windows.Forms.HtmlDocument);
  private static Type htmlShimManagerType = null;
  private static object htmlShimSingleton = null;
  private static ConstructorInfo docCtor = null;

  public static HtmlDocument Create()
  {
    if (htmlShimManagerType == null)
    {
      // get a type reference to HtmlShimManager
      htmlShimManagerType = htmlDocType.Assembly.GetType(
        "System.Windows.Forms.HtmlShimManager"
        );
      // locate the necessary private constructor for HtmlShimManager
      var shimCtor = htmlShimManagerType.GetConstructor(
        BindingFlags.NonPublic | BindingFlags.Instance, null, new Type[0], null
        );
      // create a new HtmlShimManager object and keep it for the rest of the
      // assembly instance
      htmlShimSingleton = shimCtor.Invoke(null);
    }

    if (docCtor == null)
    {
      // get the only constructor for HtmlDocument (which is marked as private)
      docCtor = htmlDocType.GetConstructors(
        BindingFlags.NonPublic | BindingFlags.Instance
        )[0];
    }

    // create an instance of mshtml.HTMLDocument2 (in the form of 
    // IHTMLDocument2 using HTMLDocument2's class ID)
    object htmlDoc2Inst = Activator.CreateInstance(Type.GetTypeFromCLSID(
      new Guid("25336920-03F9-11CF-8FD0-00AA00686F13")
      ));
    var argValues = new object[] { htmlShimSingleton, htmlDoc2Inst };
    // create a new HtmlDocument without involving WebBrowser
    return (HtmlDocument)docCtor.Invoke(argValues);
  }
}

To use it:

var htmlDoc = HtmlDocumentFactory.Create();
htmlDoc.Write("<html><body><div>Hello, world!</body></div></html>");
Console.WriteLine(htmlDoc.Body.InnerText);
// output:
// Hello, world!

I have not tested this code directly -- I have translated it from an old Powershell script that needed the same functionality you're requesting. If it fails, let me know. The functionality is there but the code might need very minor tweaking to get working.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文