使用 Html Agility Pack 以上下文敏感的方式解析节点
<div class="mvb"><b>Date 1</b></div>
<div class="mxb"><b>Header 1</b></div>
<div>
inner hmtl 1
</div>
<div class="mvb"><b>Date 2</b></div>
<div class="mxb"><b>Header 2</b></div>
<div>
inner html 2
</div>
我想以这样的方式解析标签之间的内部 html
- * associate the inner html 1 with header 1 and date 1
- * associate the inner html 2 with header 2 and date 2
换句话说,在解析内部 html 1 时,我想知道包含“日期 1”和“标题 1”的 html 节点已被解析(但包含“日期 2”和“标题 2”的节点尚未被解析)
如果我通过常规文本解析执行此操作,我将一次读取一行并记录最后一个“日期”和“标题” “比我解析的要多。然后,当需要解析内部 html 1 时,我可以引用最后解析的“日期”和“标题”对象将它们关联在一起。
<div class="mvb"><b>Date 1</b></div>
<div class="mxb"><b>Header 1</b></div>
<div>
inner hmtl 1
</div>
<div class="mvb"><b>Date 2</b></div>
<div class="mxb"><b>Header 2</b></div>
<div>
inner html 2
</div>
I would like to parse the inner html between the tags in such a way that I can
- * associate the inner html 1 with header 1 and date 1
- * associate the inner html 2 with header 2 and date 2
In other words, at the time I parse the inner html 1 I would like to know that the html nodes containing "Date 1" and "Header 1" have been parsed (but the nodes containing "Date 2" and "Header 2" have not been parsed)
If I were doing this via regular text parsing, I would read one line at a time and record the last "Date" and "Header" than I had parsed. Then when it came time to parse the inner html 1, I could refer to the last parsed "Date" and "Header" object to associate them together.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用 Html Agility Pack,您可以利用 XPATH 功能 - 并忘记那些冗长的 xlinq 废话:-)。 XPATH position() 函数是上下文相关的。这是一个示例代码:
它将与您的示例一起打印:
Using the Html Agility Pack, you can leverage XPATH power - and forget about that verbose xlinq crap :-). The XPATH position() function is context sensitive. Here is a sample code:
That will prrint this with your sample:
好吧,您可以通过多种方式执行此操作...
例如,如果您要解析的 HTML 是您在问题中编写的 HTML,则一种简单的方法可能是:
如果一切正常并且 HTML 具有该布局,则这 3 个集合中的元素数量将相同。
然后您可以轻松地执行
以下操作:
这是执行此操作的一种方法。
当然,您应该检查是否存在异常(
.SelectNodes(..)
方法未返回null
),在存储innerTextNodes 时检查 LINQ 表达式中的错误
,并将for (...)
重构为接收HtmlNode
并返回InnerText
属性的方法它。考虑一下,在您发布的 HTML 代码中,您知道包含内部文本的
标记是什么的唯一方法是假设它是旁边的标记包含标题的
标记。这就是我使用 LINQ 表达式的原因。
另一种了解它的方法可能是
是否具有某些特定属性(如
class="___"
)或类似属性,或者它内部是否包含一些标签而不仅仅是文字。解析 HTML 时并没有什么魔力:)编辑:
我没有测试过这段代码。亲自测试一下,让我知道它是否有效。
Well, you can do this in several ways...
For example, if the HTML you want to parse is the one you wrote in your question, an easy way could be:
If everything is okay and the HTML has that layout, you will have the same number of elements in both 3 collections.
Then you can easily do:
The following should work:
This is a way of doing this.
Of course, you should check for exceptions (that
.SelectNodes(..)
method are not returningnull
), check for errors in the LINQ expression when storinginnerTextNodes
, and refactor thefor (...)
, maybe into a method that receives aHtmlNode
and returns theInnerText
property of it.Take in count that the only way you can know, in the HTML code you posted, what is the
<div>
tag that contains the Inner Text, is to assume it is the one that is next to the<div>
tag that contains the Header. That's why I used the LINQ expression.Another way of knowing it could be if the
<div>
has some particular attribute (likeclass="___"
) or similar, or if it contains some tags inside it and not just text. There is no magic when parsing HTMLs :)Edit:
I have not tested this code. Test it by yourself and let me know if it worked.