HtmlAgilityPack 给出了格式错误的 html 问题
我想从 html 文档中提取有意义的文本,并且我使用 html-agility-pack 来实现相同的目的。这是我的代码:
string convertedContent = HttpUtility.HtmlDecode(
ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString))
);
ConvertHtml:
public string ConvertHtml(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringWriter sw = new StringWriter();
ConvertTo(doc.DocumentNode, sw);
sw.Flush();
return sw.ToString();
}
ConvertTo:
public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText)
{
string html;
switch (node.NodeType)
{
case HtmlAgilityPack.HtmlNodeType.Comment:
// don't output comments
break;
case HtmlAgilityPack.HtmlNodeType.Document:
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
break;
case HtmlAgilityPack.HtmlNodeType.Text:
// script and style must not be output
string parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style"))
break;
// get text
html = ((HtmlTextNode)node).Text;
// is it in fact a special closing node output as text?
if (HtmlNode.IsOverlappedClosingElement(html))
break;
// check the text is meaningful and not a bunch of whitespaces
if (html.Trim().Length > 0)
{
outText.Write(HtmlEntity.DeEntitize(html) + " ");
}
break;
case HtmlAgilityPack.HtmlNodeType.Element:
switch (node.Name)
{
case "p":
// treat paragraphs as crlf
outText.Write("\r\n");
break;
}
if (node.HasChildNodes)
{
foreach (HtmlNode subnode in node.ChildNodes)
{
ConvertTo(subnode, outText);
}
}
break;
}
}
现在在某些情况下,当 html 页面格式错误时(例如以下页面 - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html 具有格式错误的元标记,例如 <meta content="text /html; charset=uft-8" http-equiv="Content-Type">
) [注意“uft”而不是 utf] 当我尝试加载 html 时,我的代码正在呕吐文档。
有人可以建议我如何克服这些格式错误的 html 页面并仍然从 html 文档中提取相关文本吗?
谢谢, 卡皮尔
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
正如 HtmlAgilityPack 项目页面中所说,“解析器对‘现实世界’格式错误的 HTML 非常宽容”。但你描述的这种错误太严重了,可能无法纠正。您可以使用以下命令设置默认编码:
As it is said in the HtmlAgilityPack project page "The parser is very tolerant with 'real world' malformed HTML". But the kind of error you describe is too serious maybe to be corrected. You can set the default encoding with: