当前位置：文江博客话题详情

C# html-agility-pack

HtmlAgilityPack 给出了格式错误的 html 问题

发布于 2024-09-03 13:45:17 字数 2618 浏览 4 评论 0 原文

我想从 html 文档中提取有意义的文本，并且我使用 html-agility-pack 来实现相同的目的。这是我的代码：

string convertedContent = HttpUtility.HtmlDecode(
    ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString))
);

ConvertHtml:

public string ConvertHtml(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

ConvertTo:

public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText)
{
    string html;
    switch (node.NodeType)
    {
        case HtmlAgilityPack.HtmlNodeType.Comment:
            // don't output comments
            break;

        case HtmlAgilityPack.HtmlNodeType.Document:
            foreach (HtmlNode subnode in node.ChildNodes)
            {
              ConvertTo(subnode, outText);
            }
            break;

        case HtmlAgilityPack.HtmlNodeType.Text:
            // script and style must not be output
            string parentName = node.ParentNode.Name;
            if ((parentName == "script") || (parentName == "style"))
                break;

            // get text
            html = ((HtmlTextNode)node).Text;

            // is it in fact a special closing node output as text?
            if (HtmlNode.IsOverlappedClosingElement(html))
                break;

            // check the text is meaningful and not a bunch of whitespaces
            if (html.Trim().Length > 0)
            {
                outText.Write(HtmlEntity.DeEntitize(html) + " ");
            }
            break;

        case HtmlAgilityPack.HtmlNodeType.Element:
            switch (node.Name)
            {
                case "p":
                    // treat paragraphs as crlf
                    outText.Write("\r\n");
                    break;
            }

            if (node.HasChildNodes)
            {
            foreach (HtmlNode subnode in node.ChildNodes)
             {
              ConvertTo(subnode, outText);
             }
            }
            break;
    }
}

现在在某些情况下，当 html 页面格式错误时（例如以下页面 - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html 具有格式错误的元标记，例如 <meta content="text /html; charset=uft-8" http-equiv="Content-Type">) [注意“uft”而不是 utf] 当我尝试加载 html 时，我的代码正在呕吐文档。

有人可以建议我如何克服这些格式错误的 html 页面并仍然从 html 文档中提取相关文本吗？

谢谢，卡皮尔

原文

I want to extract meaningful text out of an html document and I was using html-agility-pack for the same. Here is my code:

string convertedContent = HttpUtility.HtmlDecode(
    ConvertHtml(HtmlAgilityPack.HtmlEntity.DeEntitize(htmlAsString))
);

ConvertHtml:

public string ConvertHtml(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

ConvertTo:

public void ConvertTo(HtmlAgilityPack.HtmlNode node, TextWriter outText)
{
    string html;
    switch (node.NodeType)
    {
        case HtmlAgilityPack.HtmlNodeType.Comment:
            // don't output comments
            break;

        case HtmlAgilityPack.HtmlNodeType.Document:
            foreach (HtmlNode subnode in node.ChildNodes)
            {
              ConvertTo(subnode, outText);
            }
            break;

        case HtmlAgilityPack.HtmlNodeType.Text:
            // script and style must not be output
            string parentName = node.ParentNode.Name;
            if ((parentName == "script") || (parentName == "style"))
                break;

            // get text
            html = ((HtmlTextNode)node).Text;

            // is it in fact a special closing node output as text?
            if (HtmlNode.IsOverlappedClosingElement(html))
                break;

            // check the text is meaningful and not a bunch of whitespaces
            if (html.Trim().Length > 0)
            {
                outText.Write(HtmlEntity.DeEntitize(html) + " ");
            }
            break;

        case HtmlAgilityPack.HtmlNodeType.Element:
            switch (node.Name)
            {
                case "p":
                    // treat paragraphs as crlf
                    outText.Write("\r\n");
                    break;
            }

            if (node.HasChildNodes)
            {
            foreach (HtmlNode subnode in node.ChildNodes)
             {
              ConvertTo(subnode, outText);
             }
            }
            break;
    }
}

Now in some cases when the html pages are malformed (for example the following page - http://rareseeds.com/cart/products/Purple_of_Romagna_Artichoke-646-72.html has a malformed meta-tag like <meta content="text/html; charset=uft-8" http-equiv="Content-Type">) [Note "uft" instead of utf] my code is puking at the time I am trying to load the html document.

Can someone suggest me how can I overcome these malformed html pages and still extract relevant text out of a html document?

Thanks,
Kapil

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

在梵高的星空下 2024-09-10 13:45:17

正如 HtmlAgilityPack 项目页面中所说，“解析器对‘现实世界’格式错误的 HTML 非常宽容”。但你描述的这种错误太严重了，可能无法纠正。您可以使用以下命令设置默认编码：

 HtmlDocument doc = new HtmlDocument();
 doc.OptionDefaultStreamEncoding = Encoding.UTF8;

As it is said in the HtmlAgilityPack project page "The parser is very tolerant with 'real world' malformed HTML". But the kind of error you describe is too serious maybe to be corrected. You can set the default encoding with:

 HtmlDocument doc = new HtmlDocument();
 doc.OptionDefaultStreamEncoding = Encoding.UTF8;

回复收藏 0 原文

~没有更多了~

关于作者

眼前雾蒙蒙

暂无简介

文章

26 人气

关注发私信

西西弗的石头怪

文章 0 评论 0

关注

5397313

文章 0 评论 0

关注

烟沫凡尘

文章 0 评论 0

关注

一个破名字

文章 0 评论 0

关注

萌︼了一个春

文章 0 评论 0

关注

当爱已成负担

文章 0 评论 0

友情链接

文江博客

HtmlAgilityPack 给出了格式错误的 html 问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

西西弗的石头怪

5397313

烟沫凡尘

一个破名字

萌︼了一个春

当爱已成负担

友情链接

HtmlAgilityPack 给出了格式错误的 html 问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

西西弗的石头怪

5397313

烟沫凡尘

一个破名字

萌︼了一个春

当爱已成负担

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。