HTML Agility Pack 条带标签不在白名单中
我正在尝试创建一个函数来删除不在白名单中的 html 标签和属性。 我有以下 HTML:
<b>first text </b>
<b>second text here
<a>some text here</a>
<a>some text here</a>
</b>
<a>some twxt here</a>
我正在使用 HTML 敏捷包,到目前为止我拥有的代码是:
static List<string> WhiteNodeList = new List<string> { "b" };
static List<string> WhiteAttrList = new List<string> { };
static HtmlNode htmlNode;
public static void RemoveNotInWhiteList(out string _output, HtmlNode pNode, List<string> pWhiteList, List<string> attrWhiteList)
{
// remove all attributes not on white list
foreach (var item in pNode.ChildNodes)
{
item.Attributes.Where(u => attrWhiteList.Contains(u.Name) == false).ToList().ForEach(u => RemoveAttribute(u));
}
// remove all html and their innerText and attributes if not on whitelist.
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.Remove());
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.ParentNode.ReplaceChild(ConvertHtmlToNode(u.InnerHtml),u));
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.Remove());
for (int i = 0; i < pNode.ChildNodes.Count; i++)
{
if (!pWhiteList.Contains(pNode.ChildNodes[i].Name))
{
HtmlNode _newNode = ConvertHtmlToNode(pNode.ChildNodes[i].InnerHtml);
pNode.ChildNodes[i].ParentNode.ReplaceChild(_newNode, pNode.ChildNodes[i]);
if (pNode.ChildNodes[i].HasChildNodes && !string.IsNullOrEmpty(pNode.ChildNodes[i].InnerText.Trim().Replace("\r\n", "")))
{
HtmlNode outputNode1 = pNode.ChildNodes[i];
for (int j = 0; j < pNode.ChildNodes[i].ChildNodes.Count; j++)
{
string _childNodeOutput;
RemoveNotInWhiteList(out _childNodeOutput,
pNode.ChildNodes[i], WhiteNodeList, WhiteAttrList);
pNode.ChildNodes[i].ReplaceChild(ConvertHtmlToNode(_childNodeOutput), pNode.ChildNodes[i].ChildNodes[j]);
i++;
}
}
}
}
// Console.WriteLine(pNode.OuterHtml);
_output = pNode.OuterHtml;
}
private static void RemoveAttribute(HtmlAttribute u)
{
u.Value = u.Value.ToLower().Replace("javascript", "");
u.Remove();
}
public static HtmlNode ConvertHtmlToNode(string html)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode.ChildNodes.Count == 1)
return doc.DocumentNode.ChildNodes[0];
else return doc.DocumentNode;
}
我试图实现的输出是
<b>first text </b>
<b>second text here
some text here
some text here
</b>
some twxt here
这意味着我只想保留 标签.
我这样做的原因是因为一些用户将 MS WORD 复制粘贴到任何所见即所得的 html 编辑器中。
谢谢。!
I'm trying to create a function which removes html tags and attributes which are not in a white list.
I have the following HTML:
<b>first text </b>
<b>second text here
<a>some text here</a>
<a>some text here</a>
</b>
<a>some twxt here</a>
I am using HTML agility pack and the code I have so far is:
static List<string> WhiteNodeList = new List<string> { "b" };
static List<string> WhiteAttrList = new List<string> { };
static HtmlNode htmlNode;
public static void RemoveNotInWhiteList(out string _output, HtmlNode pNode, List<string> pWhiteList, List<string> attrWhiteList)
{
// remove all attributes not on white list
foreach (var item in pNode.ChildNodes)
{
item.Attributes.Where(u => attrWhiteList.Contains(u.Name) == false).ToList().ForEach(u => RemoveAttribute(u));
}
// remove all html and their innerText and attributes if not on whitelist.
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.Remove());
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.ParentNode.ReplaceChild(ConvertHtmlToNode(u.InnerHtml),u));
//pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.Remove());
for (int i = 0; i < pNode.ChildNodes.Count; i++)
{
if (!pWhiteList.Contains(pNode.ChildNodes[i].Name))
{
HtmlNode _newNode = ConvertHtmlToNode(pNode.ChildNodes[i].InnerHtml);
pNode.ChildNodes[i].ParentNode.ReplaceChild(_newNode, pNode.ChildNodes[i]);
if (pNode.ChildNodes[i].HasChildNodes && !string.IsNullOrEmpty(pNode.ChildNodes[i].InnerText.Trim().Replace("\r\n", "")))
{
HtmlNode outputNode1 = pNode.ChildNodes[i];
for (int j = 0; j < pNode.ChildNodes[i].ChildNodes.Count; j++)
{
string _childNodeOutput;
RemoveNotInWhiteList(out _childNodeOutput,
pNode.ChildNodes[i], WhiteNodeList, WhiteAttrList);
pNode.ChildNodes[i].ReplaceChild(ConvertHtmlToNode(_childNodeOutput), pNode.ChildNodes[i].ChildNodes[j]);
i++;
}
}
}
}
// Console.WriteLine(pNode.OuterHtml);
_output = pNode.OuterHtml;
}
private static void RemoveAttribute(HtmlAttribute u)
{
u.Value = u.Value.ToLower().Replace("javascript", "");
u.Remove();
}
public static HtmlNode ConvertHtmlToNode(string html)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
if (doc.DocumentNode.ChildNodes.Count == 1)
return doc.DocumentNode.ChildNodes[0];
else return doc.DocumentNode;
}
The output I am tryig to achieve is
<b>first text </b>
<b>second text here
some text here
some text here
</b>
some twxt here
That means that I only want to keep the <b>
tags.
The reason i'm doing this is because Some of the users do cpoy-paste from MS WORD into ny WYSYWYG html editor.
Thanks.!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
呵呵,显然我几乎在某人发表的博客文章中找到了答案......
我从这里得到了 HtmlSanitizer
显然它不会剥离标签,而是完全删除元素。
好的,这是为以后需要的人提供的解决方案。
我重命名了该节点,因为如果我必须解析 XML 命名空间节点,它会在 xpath 解析时崩溃。
heh, apparently I ALMOST found an answer in a blog post someone made....
I got HtmlSanitizer from here
Apparently it does not strip th tags, but removes the element altoghether.
OK, here is the solution for those who will need it later.
I renamed the node because if I had to parse an XML namespace node it would crash on the xpath parsing.
感谢您的代码 - 太棒了!!!!
我做了一些优化...
Thanks for the code - great thing!!!!
I did few optimization...