使用 Agility Pack 从 HTML 中删除所有已指定类的元素
我试图选择具有给定类的所有元素并将它们从 HTML 字符串中删除。
这是我到目前为止所拥有的,尽管源代码清楚地显示了具有该类名称的 4 个元素,但它似乎没有删除任何内容。
// Filter page HTML to display required content
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// filePath is a path to a file containing the html
htmlDoc.LoadHtml(pageHTML);
// ParseErrors is an ArrayList containing any errors from the Load statement);
if (!htmlDoc.ParseErrors.Any())
{
// Remove all elements marked with pdf-ignore class
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//body[@class='pdf-ignore']");
// Remove the collection from above
foreach (var node in nodes)
{
node.Remove();
}
}
编辑:只是为了澄清文档正在解析并且 SelectNodes 行正在被命中,只是不返回任何内容。
这是 html 的一个片段:
<input type=\"submit\" name=\"ctl00$MainContent$PrintBtn\" value=\"Print Shotlist\" onclick=\"window.print();\" id=\"MainContent_PrintBtn\" class=\"pdf-ignore\">
I'm trying to select all elements that have a given class and remove them from a HTML string.
This is what I have so far it doesn't seem to remove anything although the source shows clearly 4 elements with that class name.
// Filter page HTML to display required content
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// filePath is a path to a file containing the html
htmlDoc.LoadHtml(pageHTML);
// ParseErrors is an ArrayList containing any errors from the Load statement);
if (!htmlDoc.ParseErrors.Any())
{
// Remove all elements marked with pdf-ignore class
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//body[@class='pdf-ignore']");
// Remove the collection from above
foreach (var node in nodes)
{
node.Remove();
}
}
EDIT: Just to clarify the document is parsing and the SelectNodes line is being hit, just not returning anything.
Here is a snippet of the html:
<input type=\"submit\" name=\"ctl00$MainContent$PrintBtn\" value=\"Print Shotlist\" onclick=\"window.print();\" id=\"MainContent_PrintBtn\" class=\"pdf-ignore\">
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
编辑:在更新后的答案中,您将 HTML 字符串的一部分发布为
元素声明,但您试图匹配
;
类pdf-ignore
元素(根据您的表达式//body[@class='pdf-ignore']
)。如果您想将文档中的所有元素与此类匹配,您应该使用:
code 来获取节点。这将匹配具有指定类名的所有元素。
除了一个细节之外,您的代码似乎是正确的:条件
htmlDoc.ParseErrors == null
。仅当ParseErrors
属性(这是IEnumerable
的类型)为null
时才选择并删除节点,但实际上如果没有错误发现这个属性返回一个空列表。因此,将代码更改为:应该可以解决问题。
EDIT: in your updated answer you posted a part of the HTML string an
<input>
element declaration, but you're trying to match a<body>
element with the classpdf-ignore
(according to your expression//body[@class='pdf-ignore']
).If you want to match all the elements from the document with this class you should use:
code to get your nodes. This will match all the elements with the class name specified.
Your code is seems to be correct except the one detail: the condition
htmlDoc.ParseErrors == null
. You select and remove nodes ONLY if theParseErrors
property (which is a type ofIEnumerable<HtmlParseError>
) isnull
, but actually if no errors found this property returns an empty list. So changing your code to:should solve the issue.
您的 xpath 可能不匹配:您是否尝试过
"//div[class='pdf-ignore']"
(没有"@"
)?Your xpath is probably not matching: have you tried
"//div[class='pdf-ignore']"
(no"@"
)?