寻找一个好的 HTML 解析器，它将提供类似 offsetHeight 的值

发布于 2024-08-22 17:21:28 字数 1090 浏览 15 评论 0原文

我有一个项目需要我将 HTML 文档作为字符串加载并解析它。我试图确定哪个 HTML 节点将超过页面的高度 (8.5x11)，以便我可以在其之前插入“page-break-after”。这将通过我正在生成的 .NET dll 来完成。

我尝试过使用 mshtml dom。将字符串值加载到其中并不容易，当我设法完成此操作时，offsetHeight（等）属性总是返回零。我发现完成这项工作的唯一方法是将 HTML 保存到磁盘，通过 SHDocVw.InternetExplorer 加载它，然后将其传递到 mshtml dom。

我假设除非 HTML 是由 SHDocVw“渲染”的，否则我没有要报告的 mshtml 的 offsetHeight 信息，因为这是基于屏幕像素的。我可能是错的。

我当前的代码如下：

Dim myIE As New SHDocVw.InternetExplorer
myIE.Navigate("D:\Temp\Test.HTML")
Dim myDoc As mshtml.HTMLDocument = CType(myIE.Document, mshtml.HTMLDocument)

Dim divTag As mshtml.IHTMLElement = myDoc.getElementById("someID")

For Each childNode As mshtml.IHTMLElement In TryCast(divTag.children, mshtml.IHTMLElementCollection)
    If childNode.offsetTop + childNode.offsetHeight > 750 Then '72pixels = 1 inch.
         childNode.insertAdjacentHTML("beforeBegin", "<DIV style='page-break-after:always'></DIV>") 
    End If
Next

我有两个目标。 #1 是关键，#2 是理想的。

1) 从字符串加载 HTML，并使上面的代码仍然有效。

2) 理想情况下，找到一个可以执行相同操作的 .NET 组件。我不喜欢依赖 .NET 中的 COM 组件，除非别无选择。

原文

I have a project which requires me to load an HTML document as a string, and parse it. I am trying to determine which HTML node will exceed the height of a page (8.5x11) so I can insert a ‘page-break-after’ before it. This will be done with a .NET dll I am producing.

I have tried using the mshtml dom. It’s not easy to load a string value into this, and when I did manage to accomplish this the offsetHeight (etc) properties always return zero. The only way I have found to make this work is to save the HTML to disk, load it via SHDocVw.InternetExplorer, and then pass that to the mshtml dom.

I’m assuming that unless the HTML is ‘rendered’ by SHDocVw, I have no offsetHeight information for mshtml to report, as this is based on screen pixels. I could be wrong.

My current code is as follows:

Dim myIE As New SHDocVw.InternetExplorer
myIE.Navigate("D:\Temp\Test.HTML")
Dim myDoc As mshtml.HTMLDocument = CType(myIE.Document, mshtml.HTMLDocument)

Dim divTag As mshtml.IHTMLElement = myDoc.getElementById("someID")

For Each childNode As mshtml.IHTMLElement In TryCast(divTag.children, mshtml.IHTMLElementCollection)
    If childNode.offsetTop + childNode.offsetHeight > 750 Then '72pixels = 1 inch.
         childNode.insertAdjacentHTML("beforeBegin", "<DIV style='page-break-after:always'></DIV>") 
    End If
Next

I have two goals. #1 is key, #2 ideal.

1) Load the HTML from a string, and have the above code still work.

2) Idealy, find a .NET component that will do the same thing. I don’t like relying on COM components in .NET unless I have no choice.

分享到QQ

分享到微博