在 C# 中解析 HTML 部分
我需要解析 HTML 字符串中的部分。例如:
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>[section=quote]</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>[/section]</p>
解析引用部分应返回:
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
目前我正在使用正则表达式来获取 [section=quote]...[/section] 内的内容,但由于这些部分是使用所见即所得编辑器输入的,因此部分标记它们本身被包裹在段落标记中,因此解析的结果是:
</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>
我当前使用的正则表达式是:
\[section=(.+?)\](.+?)\[/section\]
在解析这些部分之前我还在做一些额外的清理:
protected string CleanHtml(string input) {
// remove whitespace
input = Regex.Replace(input, @"\s*(<[^>]+>)\s*", "$1", RegexOptions.Singleline);
// remove empty p elements
input = Regex.Replace(input, @"<p\s*/>|<p>\s*</p>", string.Empty);
return input;
}
任何人都可以提供可以实现我的目的的正则表达式正在寻找或者我正在浪费时间尝试使用正则表达式来做到这一点?我已经看到对 Html Agility Pack 的引用 - 对于这样的事情来说这会更好吗?
[更新]
感谢 Oscar,我结合使用了 HTML Agility pack 和 Regex 来解析这些部分。它还需要一些改进,但已经接近完成了。
public void ParseSections(string content)
{
this.SourceContent = content;
this.NonSectionedContent = content;
content = CleanHtml(content);
if (!sectionRegex.IsMatch(content))
return;
var doc = new HtmlDocument();
doc.LoadHtml(content);
bool flag = false;
string sectionName = string.Empty;
var sectionContent = new StringBuilder();
var unsectioned = new StringBuilder();
foreach (var n in doc.DocumentNode.SelectNodes("//p")) {
if (startSectionRegex.IsMatch(n.InnerText)) {
flag = true;
sectionName = startSectionRegex.Match(n.InnerText).Groups[1].Value.ToLowerInvariant();
continue;
}
if (endSectionRegex.IsMatch(n.InnerText)) {
flag = false;
this.Sections.Add(sectionName, sectionContent.ToString());
sectionContent.Clear();
continue;
}
if (flag)
sectionContent.Append(n.OuterHtml);
else
unsectioned.Append(n.OuterHtml);
}
this.NonSectionedContent = unsectioned.ToString();
}
I need to parse sections from a string of HTML. For example:
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
<p>[section=quote]</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>[/section]</p>
Parsing the quote section should return:
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
Currently I'm using a regular expression to grab the content inside [section=quote]...[/section], but since the sections are entered using a WYSIWYG editor, the section tags themselves get wrapped in a paragraph tag, so the parsed result is:
</p>
<p>Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.</p>
<p>
The Regular Expression I'm using currently is:
\[section=(.+?)\](.+?)\[/section\]
And I'm also doing some additional cleanup prior to parsing the sections:
protected string CleanHtml(string input) {
// remove whitespace
input = Regex.Replace(input, @"\s*(<[^>]+>)\s*", "$1", RegexOptions.Singleline);
// remove empty p elements
input = Regex.Replace(input, @"<p\s*/>|<p>\s*</p>", string.Empty);
return input;
}
Can anyone provide a regular expression that would achieve what I am looking for or am I wasting my time trying to do this with Regex? I've seen references to the Html Agility Pack - would this be better for something like this?
[Update]
Thanks to Oscar I have used a combination of the HTML Agility pack and Regex to parse the sections. It still needs a bit of refining but it's nearly there.
public void ParseSections(string content)
{
this.SourceContent = content;
this.NonSectionedContent = content;
content = CleanHtml(content);
if (!sectionRegex.IsMatch(content))
return;
var doc = new HtmlDocument();
doc.LoadHtml(content);
bool flag = false;
string sectionName = string.Empty;
var sectionContent = new StringBuilder();
var unsectioned = new StringBuilder();
foreach (var n in doc.DocumentNode.SelectNodes("//p")) {
if (startSectionRegex.IsMatch(n.InnerText)) {
flag = true;
sectionName = startSectionRegex.Match(n.InnerText).Groups[1].Value.ToLowerInvariant();
continue;
}
if (endSectionRegex.IsMatch(n.InnerText)) {
flag = false;
this.Sections.Add(sectionName, sectionContent.ToString());
sectionContent.Clear();
continue;
}
if (flag)
sectionContent.Append(n.OuterHtml);
else
unsectioned.Append(n.OuterHtml);
}
this.NonSectionedContent = unsectioned.ToString();
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用
HtmlAgilityPack
库可以实现以下效果:...
如果您只想打印
Mauris at turpis nec dolor bibendum sollicitudin ac quis neque。
如果没有...
,您可以替换
n.OuterHtml
代码>由n.InnerHtml
。当然,您应该检查
doc.DocumentNode.SelectNodes("//p")
是否为null
。如果您想从在线源而不是文件加载 html,您可以执行以下操作:
编辑:
如果
[section=quote]
是[/section]
可以位于任何标签内(并不总是),您可以将
doc.DocumentNode.SelectNodes("//p")
替换为doc.DocumentNode.SelectNodes("//*")
。The following works, using
HtmlAgilityPack
library:...
If you just want to print
Mauris at turpis nec dolor bibendum sollicitudin ac quis neque.
without<p>...</p>
, you can replacen.OuterHtml
byn.InnerHtml
.Of course, you should check if
doc.DocumentNode.SelectNodes("//p")
isnull
.If you want to load the html from an online source instead of a file, you can do:
Edit:
If
[section=quote]
an[/section]
could be inside any tag (not always<p>
), you can replacedoc.DocumentNode.SelectNodes("//p")
bydoc.DocumentNode.SelectNodes("//*")
.怎么样
替换为
和
with
作为清理工作的一部分, ?然后您可以使用现有的正则表达式。
How about replacing
with
and
with
as part of your cleanup. Then you can use your existing regular expression.