使用 HTML Agility 去除 HTML 标签但保留内部文本?
我正在尝试删除一些 HTML 标签。我有一个项目,该人保存了一些搜索。问题是关键字已突出显示。例如。
<p>Here is some <span class='highlite'>awesome</span> example.</p>
Html Agility 将其变成 3 个节点。一个文本节点,再次跨度和文本。我想用它创建一个标签。所以看起来
<p>Here is some awesome example.</p>
我尝试使用 css class highlite 获取所有标签,然后
//Stip all retarded hilite tags
var hiliteTags = from tags in doc.DocumentNode.SelectNodes("//span[@class='hilite']")
select tags;
foreach (var tag in hiliteTags)
{
tag.ParentNode.RemoveChild(tag, true);
}
结果是文本节点,文本节点,文本节点。我想要一个文本节点。然后我尝试使用
Node.InnerText += someVariable;
InnerText,尽管文档说它是只读的。
关于如何做到这一点有什么想法吗?
其次,当我问时,有没有办法摆脱只包含文本的节点,它是一个\r\n。我对此根本不感兴趣,它只会妨碍并使解析变得尴尬。我也希望能够删除那些。例如,
<tr>
<td>Foo</td>
<td>Bar</td>
</tr>
使用 Html Agility 变得
Node (tr)
Node (\r\n)
Node (td- Foo)
Node (\r\n)
Node (td - Bar)
Node (\r\n)
Node (tr)
我很难选择这些节点。我尝试过使用 Linq,也尝试过使用 XPath。我似乎无法删除它们。
I am trying to strip out some HTML tags. I have a project where the person has saved some searches. Problem is the keywords have been highlighted. For example.
<p>Here is some <span class='highlite'>awesome</span> example.</p>
Html Agility turns this into 3 Nodes. A text node, span and text again. I would to create a single tag out of this. So that it looks like
<p>Here is some awesome example.</p>
I tried getting all tags with css class highlite and then
//Stip all retarded hilite tags
var hiliteTags = from tags in doc.DocumentNode.SelectNodes("//span[@class='hilite']")
select tags;
foreach (var tag in hiliteTags)
{
tag.ParentNode.RemoveChild(tag, true);
}
but that results in, text node, text node, text node. I was wanting one text node. I then tried to use
Node.InnerText += someVariable;
but InnerText, despite what the documenation says is read only.
Any ideas on how to do this?
Secondly while I am asking, is there a way to get rid of Nodes that contain just text and it is a \r\n. I am not interested in that at all and it just gets in the way and makes the parsing awkward. I would like to be able to remove those too. For example
<tr>
<td>Foo</td>
<td>Bar</td>
</tr>
using Html Agility becomes
Node (tr)
Node (\r\n)
Node (td- Foo)
Node (\r\n)
Node (td - Bar)
Node (\r\n)
Node (tr)
I am struggling to select those nodes. I have tried with Linq and I have tried using XPath. I just can't seem to remove them.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您只使用 p 标记的 InnerText,并创建一个单独的文档树来保存它,会怎么样?
这有帮助吗?
What if you just take the InnerText of the p tag, and create a separate document tree to save it.
Does this help?