使用 Html Agility Pack 从 html 中抓取所有文本

发布于 2024-10-02 15:46:21 字数 323 浏览 5 评论 0原文

输入

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

输出

foo
bar
baz

我知道 htmldoc.DocumentNode.InnerText,但它会给出 foobarbaz - 我想获取每个文本,而不是一次获取所有文本。

Input

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

Output

foo
bar
baz

I know of htmldoc.DocumentNode.InnerText, but it will give foobarbaz - I want to get each text, not all at a time.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

追我者格杀勿论 2024-10-09 15:46:22

XPATH 是你的朋友:)

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");

foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    Console.WriteLine("text=" + node.InnerText);
}

XPATH is your friend :)

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");

foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    Console.WriteLine("text=" + node.InnerText);
}
冧九 2024-10-09 15:46:22
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
    if (!node.HasChildNodes)
    {
        string text = node.InnerText;
        if (!string.IsNullOrEmpty(text))
            sb.AppendLine(text.Trim());
    }
}

这满足了您的需要,但我不确定这是否是最好的方法。也许您应该迭代 DescendantNodesAndSelf 以外的其他内容以获得最佳性能。

var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
    if (!node.HasChildNodes)
    {
        string text = node.InnerText;
        if (!string.IsNullOrEmpty(text))
            sb.AppendLine(text.Trim());
    }
}

This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.

π浅易 2024-10-09 15:46:22

我需要一个提取所有文本但丢弃脚本和样式标签内容的解决方案。我在任何地方都找不到它,但我想出了以下适合我自己需求的内容:

StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n => 
    n.NodeType == HtmlNodeType.Text &&
    n.ParentNode.Name != "script" &&
    n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
    Console.WriteLine(node.InnerText);

I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:

StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n => 
    n.NodeType == HtmlNodeType.Text &&
    n.ParentNode.Name != "script" &&
    n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
    Console.WriteLine(node.InnerText);
柠檬色的秋千 2024-10-09 15:46:22
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;

html content: 的指定示例

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

将产生以下输出:

foo bar baz
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;

The specified example for html content:

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

will produce the following output:

foo bar baz
零時差 2024-10-09 15:46:22
public string html2text(string html) {
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(@"<html><body>" + html + "</body></html>");
    return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}

此解决方法基于 Html Agility Pack。您还可以通过 NuGet 安装它(包名称:HtmlAgilityPack)。

public string html2text(string html) {
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(@"<html><body>" + html + "</body></html>");
    return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}

This workaround is based on Html Agility Pack. You can also install it via NuGet (package name: HtmlAgilityPack).

好听的两个字的网名 2024-10-09 15:46:22

https://github.com/jamietre/CsQuery

您尝试过 CsQuery 吗?尽管没有得到积极维护,但它仍然是我最喜欢的将 HTML 解析为文本的方法。下面简单介绍了从 HTML 获取文本是多么简单。

var text = CQ.CreateDocument(htmlText).Text();

这是一个完整的控制台应用程序:

using System;
using CsQuery;

public class Program
{
    public static void Main()
    {
        var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
        var text = CQ.CreateDocument(html).Text();
        Console.WriteLine(text); // Output: Hello World  some text inside h1 tag under p tag

    }
}

我知道 OP 仅要求 HtmlAgilityPack,但 CsQuery 是另一个不受欢迎的解决方案,也是我发现的最好的解决方案之一,如果有人觉得这有帮助,我想分享。干杯!

https://github.com/jamietre/CsQuery

have you tried CsQuery? Though not being maintained actively - it's still my favorite for parsing HTML to Text. Here's a one liner of how simple it is to get the Text from HTML.

var text = CQ.CreateDocument(htmlText).Text();

Here's a complete console application:

using System;
using CsQuery;

public class Program
{
    public static void Main()
    {
        var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
        var text = CQ.CreateDocument(html).Text();
        Console.WriteLine(text); // Output: Hello World  some text inside h1 tag under p tag

    }
}

I understand that OP has asked for HtmlAgilityPack only but CsQuery is another unpopular and one of the best solutions I've found and wanted to share if someone finds this helpful. Cheers!

浅忆流年 2024-10-09 15:46:22

我刚刚更改并修复了一些人的答案以更好地工作:

var document = new HtmlDocument();
        document.LoadHtml(result);
        var sb = new StringBuilder();
        foreach (var node in document.DocumentNode.DescendantsAndSelf())
        {
            if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
            {
                string text = node.InnerText?.Trim();
                if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
                    sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
            }
        }

I just changed and fixed some people's answers to work better:

var document = new HtmlDocument();
        document.LoadHtml(result);
        var sb = new StringBuilder();
        foreach (var node in document.DocumentNode.DescendantsAndSelf())
        {
            if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
            {
                string text = node.InnerText?.Trim();
                if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
                    sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
            }
        }
痴骨ら 2024-10-09 15:46:22

可能类似于下面的内容(我在谷歌搜索时找到了非常基本的版本,并将其扩展为处理超链接、ul、ol、div、表格)

        /// <summary>
    /// Static class that provides functions to convert HTML to plain text.
    /// </summary>
    public static class HtmlToText {

        #region Method: ConvertFromFile (public - static)
        /// <summary>
        /// Converts the HTML content from a given file path to plain text.
        /// </summary>
        /// <param name="path">The path to the HTML file.</param>
        /// <returns>The plain text version of the HTML content.</returns>
        public static string ConvertFromFile(string path) {
            var doc = new HtmlDocument();

            // Load the HTML file
            doc.Load(path);

            using (var sw = new StringWriter()) {
                // Convert the HTML document to plain text
                ConvertTo(node: doc.DocumentNode,
                          outText: sw,
                          counters: new Dictionary<HtmlNode, int>());
                sw.Flush();
                return sw.ToString();
            }
        }
        #endregion

        #region Method: ConvertFromString (public - static)
        /// <summary>
        /// Converts the given HTML string to plain text.
        /// </summary>
        /// <param name="html">The HTML content as a string.</param>
        /// <returns>The plain text version of the HTML content.</returns>
        public static string ConvertFromString(string html) {
            var doc = new HtmlDocument();

            // Load the HTML string
            doc.LoadHtml(html);

            using (var sw = new StringWriter()) {
                // Convert the HTML string to plain text
                ConvertTo(node: doc.DocumentNode,
                          outText: sw,
                          counters: new Dictionary<HtmlNode, int>());
                sw.Flush();
                return sw.ToString();
            }
        }
        #endregion

        #region Method: ConvertTo (static)
        /// <summary>
        /// Helper method to convert each child node of the given node to text.
        /// </summary>
        /// <param name="node">The HTML node to convert.</param>
        /// <param name="outText">The writer to output the text to.</param>
        /// <param name="counters">Keep track of the ol/li counters during conversion</param>
        private static void ConvertContentTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
            // Convert each child node to text
            foreach (var subnode in node.ChildNodes) {
                ConvertTo(subnode, outText, counters);
            }
        }
        #endregion

        #region Method: ConvertTo (public - static)
        /// <summary>
        /// Converts the given HTML node to plain text.
        /// </summary>
        /// <param name="node">The HTML node to convert.</param>
        /// <param name="outText">The writer to output the text to.</param>
        public static void ConvertTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
            string html;

            switch (node.NodeType) {
                case HtmlNodeType.Comment:
                    // Don't output comments
                    break;
                case HtmlNodeType.Document:
                    // Convert entire content of document node to text
                    ConvertContentTo(node, outText, counters);
                    break;
                case HtmlNodeType.Text:
                    // Ignore script and style nodes
                    var parentName = node.ParentNode.Name;
                    if ((parentName == "script") || (parentName == "style")) {
                        break;
                    }

                    // Get text from the text node
                    html = ((HtmlTextNode)node).Text;

                    // Ignore special closing nodes output as text
                    if (HtmlNode.IsOverlappedClosingElement(html) || string.IsNullOrWhiteSpace(html)) {
                        break;
                    }

                    // Write meaningful text (not just white-spaces) to the output
                    outText.Write(HtmlEntity.DeEntitize(html));
                    break;
                case HtmlNodeType.Element:
                    switch (node.Name.ToLowerInvariant()) {
                        case "p":
                        case "div":
                        case "br":
                        case "table":
                            // Treat paragraphs and divs as new lines
                            outText.Write("\n");
                            break;
                        case "li":
                            // Treat list items as dash-prefixed lines
                            if (node.ParentNode.Name == "ol") {
                                if (!counters.ContainsKey(node.ParentNode)) {
                                    counters[node.ParentNode] = 0;
                                }
                                counters[node.ParentNode]++;
                                outText.Write("\n" + counters[node.ParentNode] + ". ");
                            } else {
                                outText.Write("\n- ");
                            }
                            break;
                        case "a":
                            // convert hyperlinks to include the URL in parenthesis
                            if (node.HasChildNodes) {
                                ConvertContentTo(node, outText, counters);
                            }
                            if (node.Attributes["href"] != null) {
                                outText.Write($" ({node.Attributes["href"].Value})");
                            }
                            break;
                        case "th":
                        case "td":
                            outText.Write(" | ");
                            break;
                    }

                    // Convert child nodes to text if they exist (ignore a href children as they are already handled)
                    if (node.Name.ToLowerInvariant() != "a" && node.HasChildNodes) {
                        ConvertContentTo(node: node,
                                         outText: outText,
                                         counters: counters);
                    }
                    break;
            }
        }
        #endregion

    } // class: HtmlToText 

Possibly something like the below (I found the very basic version while googling and extended it to handle hyperlinks, ul, ol, divs, tables)

        /// <summary>
    /// Static class that provides functions to convert HTML to plain text.
    /// </summary>
    public static class HtmlToText {

        #region Method: ConvertFromFile (public - static)
        /// <summary>
        /// Converts the HTML content from a given file path to plain text.
        /// </summary>
        /// <param name="path">The path to the HTML file.</param>
        /// <returns>The plain text version of the HTML content.</returns>
        public static string ConvertFromFile(string path) {
            var doc = new HtmlDocument();

            // Load the HTML file
            doc.Load(path);

            using (var sw = new StringWriter()) {
                // Convert the HTML document to plain text
                ConvertTo(node: doc.DocumentNode,
                          outText: sw,
                          counters: new Dictionary<HtmlNode, int>());
                sw.Flush();
                return sw.ToString();
            }
        }
        #endregion

        #region Method: ConvertFromString (public - static)
        /// <summary>
        /// Converts the given HTML string to plain text.
        /// </summary>
        /// <param name="html">The HTML content as a string.</param>
        /// <returns>The plain text version of the HTML content.</returns>
        public static string ConvertFromString(string html) {
            var doc = new HtmlDocument();

            // Load the HTML string
            doc.LoadHtml(html);

            using (var sw = new StringWriter()) {
                // Convert the HTML string to plain text
                ConvertTo(node: doc.DocumentNode,
                          outText: sw,
                          counters: new Dictionary<HtmlNode, int>());
                sw.Flush();
                return sw.ToString();
            }
        }
        #endregion

        #region Method: ConvertTo (static)
        /// <summary>
        /// Helper method to convert each child node of the given node to text.
        /// </summary>
        /// <param name="node">The HTML node to convert.</param>
        /// <param name="outText">The writer to output the text to.</param>
        /// <param name="counters">Keep track of the ol/li counters during conversion</param>
        private static void ConvertContentTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
            // Convert each child node to text
            foreach (var subnode in node.ChildNodes) {
                ConvertTo(subnode, outText, counters);
            }
        }
        #endregion

        #region Method: ConvertTo (public - static)
        /// <summary>
        /// Converts the given HTML node to plain text.
        /// </summary>
        /// <param name="node">The HTML node to convert.</param>
        /// <param name="outText">The writer to output the text to.</param>
        public static void ConvertTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
            string html;

            switch (node.NodeType) {
                case HtmlNodeType.Comment:
                    // Don't output comments
                    break;
                case HtmlNodeType.Document:
                    // Convert entire content of document node to text
                    ConvertContentTo(node, outText, counters);
                    break;
                case HtmlNodeType.Text:
                    // Ignore script and style nodes
                    var parentName = node.ParentNode.Name;
                    if ((parentName == "script") || (parentName == "style")) {
                        break;
                    }

                    // Get text from the text node
                    html = ((HtmlTextNode)node).Text;

                    // Ignore special closing nodes output as text
                    if (HtmlNode.IsOverlappedClosingElement(html) || string.IsNullOrWhiteSpace(html)) {
                        break;
                    }

                    // Write meaningful text (not just white-spaces) to the output
                    outText.Write(HtmlEntity.DeEntitize(html));
                    break;
                case HtmlNodeType.Element:
                    switch (node.Name.ToLowerInvariant()) {
                        case "p":
                        case "div":
                        case "br":
                        case "table":
                            // Treat paragraphs and divs as new lines
                            outText.Write("\n");
                            break;
                        case "li":
                            // Treat list items as dash-prefixed lines
                            if (node.ParentNode.Name == "ol") {
                                if (!counters.ContainsKey(node.ParentNode)) {
                                    counters[node.ParentNode] = 0;
                                }
                                counters[node.ParentNode]++;
                                outText.Write("\n" + counters[node.ParentNode] + ". ");
                            } else {
                                outText.Write("\n- ");
                            }
                            break;
                        case "a":
                            // convert hyperlinks to include the URL in parenthesis
                            if (node.HasChildNodes) {
                                ConvertContentTo(node, outText, counters);
                            }
                            if (node.Attributes["href"] != null) {
                                outText.Write(
quot; ({node.Attributes["href"].Value})");
                            }
                            break;
                        case "th":
                        case "td":
                            outText.Write(" | ");
                            break;
                    }

                    // Convert child nodes to text if they exist (ignore a href children as they are already handled)
                    if (node.Name.ToLowerInvariant() != "a" && node.HasChildNodes) {
                        ConvertContentTo(node: node,
                                         outText: outText,
                                         counters: counters);
                    }
                    break;
            }
        }
        #endregion

    } // class: HtmlToText 
兲鉂ぱ嘚淚 2024-10-09 15:46:22
string Body = htmlDocument.DocumentNode.SelectSingleNode("//body").InnerText;

然后你需要清理文本并删除过多的空白等。

string Body = htmlDocument.DocumentNode.SelectSingleNode("//body").InnerText;

Then you need to clean up the text and remove excessive whitespace and so on.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文