使用 HTMLAgilityPack 解析 HTML
我尝试使用 HTML Agility Pack 解析以下 HTML。
这是代码返回的整个文件的片段:
<div class="story-body fnt-13 p20-b user-gen">
<p>text here text here text </p>
<p>text here text here text text here text here text text here text here text text here text here text </p>
<div class="gallery clr bdr aln-c js-no-shadow mod cld">
<div>
<ol>
<li class="fader-item aln-c ">
<div class="imageWrap m10-b">
​<img class="http://www.domain.com/picture.png| " src="http://www.domain.com/picture.png" alt="alt text" />
</div>
<p class="caption">caption text</p>
</li>
</ol>
</div>
</div >
<p>text here text here text text here text here text text here text here text text here text here text </p>
<p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
<p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
</div>
我使用以下代码获取此代码片段(我知道这很混乱)
string url = "http://www.domain.com/story.html";
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var links = document.DocumentNode
.Descendants("div")
.Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) //
.SelectMany(div => div.Descendants("p"))
.ToList();
int cn = links.Count;
HtmlAgilityPack.HtmlNodeCollection tl = document.DocumentNode.SelectNodes("/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
textBox1.AppendText(node.InnerText.Trim());
textBox1.AppendText(System.Environment.NewLine);
}
代码循环遍历每个 p
并(目前)附加它到一个文本框。除了带有类 gallery clr bdr aln-c js-no-shadow mod cld
的 div
标记之外,所有内容都正常工作。这段 HTML 的结果是我得到了 ​
和标题文本位。
从结果中省略它的最佳方法是什么?
I have the following HTML that I'm trying to parse using the HTML Agility Pack.
This is a snippet of the whole file that is returned by the code:
<div class="story-body fnt-13 p20-b user-gen">
<p>text here text here text </p>
<p>text here text here text text here text here text text here text here text text here text here text </p>
<div class="gallery clr bdr aln-c js-no-shadow mod cld">
<div>
<ol>
<li class="fader-item aln-c ">
<div class="imageWrap m10-b">
<img class="http://www.domain.com/picture.png| " src="http://www.domain.com/picture.png" alt="alt text" />
</div>
<p class="caption">caption text</p>
</li>
</ol>
</div>
</div >
<p>text here text here text text here text here text text here text here text text here text here text </p>
<p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
<p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
</div>
I get this snippet of code using the following (which is messy i know)
string url = "http://www.domain.com/story.html";
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var links = document.DocumentNode
.Descendants("div")
.Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) //
.SelectMany(div => div.Descendants("p"))
.ToList();
int cn = links.Count;
HtmlAgilityPack.HtmlNodeCollection tl = document.DocumentNode.SelectNodes("/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
textBox1.AppendText(node.InnerText.Trim());
textBox1.AppendText(System.Environment.NewLine);
}
The code loops through each p
and (for now) appends it to a textbox. All is working correctly other than the div
tag with the class gallery clr bdr aln-c js-no-shadow mod cld
. The result of this bit of HTML is that I get the
and caption text bits.
what's the best way to omit that from the results?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
XPATH 是你的朋友。试试这个,忘记那个蹩脚的 xlink 语法:-)
这个表达式将选择所有没有设置任何属性的 P 节点。有关其他示例,请参阅此处:XPath 语法
XPATH is your friend. Try this and forget about that crappy xlink syntax :-)
This expression will select all P nodes that don't have any attributes set. See here for other samples: XPath Syntax
不太清楚你在问什么。我认为您是在问如何获取特定 div 的直接后代。如果是这种情况,请使用
ChildNodes
而不是后代
。也就是说:问题在于“后代”对文档树进行了完全递归遍历。
It's not quite clear what you're asking. I think you're asking how to get just the direct descendants of a particular div. If that's the case, then use
ChildNodes
rather thanDescendants
. That is:The problem is that
Descendants
does a fully recursive walk of the document tree.