将节点与 Html Agility Pack 结合的最佳方式
我已将大型文档从 Word 转换为 HTML。它很接近,但我有一堆“代码”节点,我想将它们合并到一个“前”节点中。
这是输入:
<p>Here's a sample MVC Controller action:</p>
<code> public ActionResult Index()</code>
<code> {</code>
<code> return View();</code>
<code> }</code>
<p>We'll start by making the following changes...</p>
我想把它变成这样:
<p>Here's a sample MVC Controller action:</p>
<pre class="brush: csharp"> public ActionResult Index()
{
return View();
}</pre>
<p>We'll start by making the following changes...</p>
我最终编写了一个蛮力循环,它迭代节点以查找连续的节点,但这对我来说似乎很难看:
HtmlDocument doc = new HtmlDocument();
doc.Load(file);
var nodes = doc.DocumentNode.ChildNodes;
string contents = string.Empty;
foreach (HtmlNode node in nodes)
{
if (node.Name == "code")
{
contents += node.InnerText + Environment.NewLine;
if (node.NextSibling.Name != "code" &&
!(node.NextSibling.Name == "#text" && node.NextSibling.NextSibling.Name == "code")
)
{
node.Name = "pre";
node.Attributes.RemoveAll();
node.SetAttributeValue("class", "brush: csharp");
node.InnerHtml = contents;
contents = string.Empty;
}
}
}
nodes = doc.DocumentNode.SelectNodes(@"//code");
foreach (var node in nodes)
{
node.Remove();
}
通常我会删除第一个循环中的节点,但这在迭代期间不起作用,因为在迭代集合时无法更改集合。
更好的想法?
I've converted a large document from Word to HTML. It's close, but I have a bunch of "code" nodes that I'd like to merge into one "pre" node.
Here's the input:
<p>Here's a sample MVC Controller action:</p>
<code> public ActionResult Index()</code>
<code> {</code>
<code> return View();</code>
<code> }</code>
<p>We'll start by making the following changes...</p>
I want to turn it into this, instead:
<p>Here's a sample MVC Controller action:</p>
<pre class="brush: csharp"> public ActionResult Index()
{
return View();
}</pre>
<p>We'll start by making the following changes...</p>
I ended up writing a brute-force loop that iterates nodes looking for consecutive ones, but this seems ugly to me:
HtmlDocument doc = new HtmlDocument();
doc.Load(file);
var nodes = doc.DocumentNode.ChildNodes;
string contents = string.Empty;
foreach (HtmlNode node in nodes)
{
if (node.Name == "code")
{
contents += node.InnerText + Environment.NewLine;
if (node.NextSibling.Name != "code" &&
!(node.NextSibling.Name == "#text" && node.NextSibling.NextSibling.Name == "code")
)
{
node.Name = "pre";
node.Attributes.RemoveAll();
node.SetAttributeValue("class", "brush: csharp");
node.InnerHtml = contents;
contents = string.Empty;
}
}
}
nodes = doc.DocumentNode.SelectNodes(@"//code");
foreach (var node in nodes)
{
node.Remove();
}
Normally I'd remove the nodes in the first loop, but that doesn't work during iteration since you can't change the collection as you iterate over it.
Better ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
第一种方法:选择所有
节点,将它们分组,并为每个组创建一个
这里的分组字段是a的组合字段父节点和组索引,当发现新组时增加。
我还在这里使用了
NextSiblingIsCode
扩展方法:它用于确定下一个同级是否是
节点。
The second approach: select only the top
<code>
node of each group, then iterate through each of these nodes to find the next<code>
node until the first non-<code>
node. I usedxpath
here:The first approach: select all the
<code>
nodes, group them, and create a<pre>
node per group:The grouping field here is combined field of a parent node and group index which is increased when new group is found.
Also I used
NextSiblingIsCode
extension method here:It used to determine whether the next sibling is a
<code>
node.The second approach: select only the top
<code>
node of each group, then iterate through each of these nodes to find the next<code>
node until the first non-<code>
node. I usedxpath
here:清理您要解析的 html。 HTML Agility Pack 条带标记不在白名单中
Sanitize the html you want to parse. HTML Agility Pack strip tags NOT IN whitelist