如何从合理的 HTML 中提取文本?

发布于 2024-08-18 22:03:56 字数 376 浏览 6 评论 0原文

我的问题有点像这个问题,但我有更多限制:

  • 我知道该文档相当健全,
  • 它们非常规则(它们都来自同一来源,
  • 我想要大约 99% 的可见文本,
  • 大约 99% 的可用内容是文本(它们或多或少将 RTF 转换为 HTML)
  • 我不关心格式,甚至段落分隔符。

是否有任何工具可以执行此操作,或者我最好只使用 RegexBuddy 和 C#?

我对命令行或批处理工具以及 C/C# 持开放态度。 /D 库。

My question is sort of like this question but I have more constraints:

  • I know the document's are reasonably sane
  • they are very regular (they all came from the same source
  • I want about 99% of the visible text
  • about 99% of what is viable at all is text (they are more or less RTF converted to HTML)
  • I don't care about formatting or even paragraph breaks.

Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#?

I'm open to command line or batch processing tools as well as C/C#/D libraries.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

凌乱心跳 2024-08-25 22:03:56

我今天使用 HTML Agility Pack 编写的这段代码将提取未格式化的修剪文本。

public static string ExtractText(string html)
{
    if (html == null)
    {
        throw new ArgumentNullException("html");
    }

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    var chunks = new List<string>(); 

    foreach (var item in doc.DocumentNode.DescendantNodesAndSelf())
    {
        if (item.NodeType == HtmlNodeType.Text)
        {
            if (item.InnerText.Trim() != "")
            {
                chunks.Add(item.InnerText.Trim());
            }
        }
    }
    return String.Join(" ", chunks);
}

如果您想保持一定程度的格式设置,可以在示例的基础上构建提供来源。

public string Convert(string path)
{
    HtmlDocument doc = new HtmlDocument();
    doc.Load(path);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

public string ConvertHtml(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

public void ConvertTo(HtmlNode node, TextWriter outText)
{
    string html;
    switch (node.NodeType)
    {
        case HtmlNodeType.Comment:
            // don't output comments
            break;

        case HtmlNodeType.Document:
            ConvertContentTo(node, outText);
            break;

        case HtmlNodeType.Text:
            // script and style must not be output
            string parentName = node.ParentNode.Name;
            if ((parentName == "script") || (parentName == "style"))
                break;

            // get text
            html = ((HtmlTextNode) node).Text;

            // is it in fact a special closing node output as text?
            if (HtmlNode.IsOverlappedClosingElement(html))
                break;

            // check the text is meaningful and not a bunch of whitespaces
            if (html.Trim().Length > 0)
            {
                outText.Write(HtmlEntity.DeEntitize(html));
            }
            break;

        case HtmlNodeType.Element:
            switch (node.Name)
            {
                case "p":
                    // treat paragraphs as crlf
                    outText.Write("\r\n");
                    break;
            }

            if (node.HasChildNodes)
            {
                ConvertContentTo(node, outText);
            }
            break;
    }
}


private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
    foreach (HtmlNode subnode in node.ChildNodes)
    {
        ConvertTo(subnode, outText);
    }
}

This code I hacked up today with HTML Agility Pack, will extract unformatted trimmed text.

public static string ExtractText(string html)
{
    if (html == null)
    {
        throw new ArgumentNullException("html");
    }

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    var chunks = new List<string>(); 

    foreach (var item in doc.DocumentNode.DescendantNodesAndSelf())
    {
        if (item.NodeType == HtmlNodeType.Text)
        {
            if (item.InnerText.Trim() != "")
            {
                chunks.Add(item.InnerText.Trim());
            }
        }
    }
    return String.Join(" ", chunks);
}

If you want to maintain some level of formatting you can build on the sample provided with the source.

public string Convert(string path)
{
    HtmlDocument doc = new HtmlDocument();
    doc.Load(path);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

public string ConvertHtml(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

public void ConvertTo(HtmlNode node, TextWriter outText)
{
    string html;
    switch (node.NodeType)
    {
        case HtmlNodeType.Comment:
            // don't output comments
            break;

        case HtmlNodeType.Document:
            ConvertContentTo(node, outText);
            break;

        case HtmlNodeType.Text:
            // script and style must not be output
            string parentName = node.ParentNode.Name;
            if ((parentName == "script") || (parentName == "style"))
                break;

            // get text
            html = ((HtmlTextNode) node).Text;

            // is it in fact a special closing node output as text?
            if (HtmlNode.IsOverlappedClosingElement(html))
                break;

            // check the text is meaningful and not a bunch of whitespaces
            if (html.Trim().Length > 0)
            {
                outText.Write(HtmlEntity.DeEntitize(html));
            }
            break;

        case HtmlNodeType.Element:
            switch (node.Name)
            {
                case "p":
                    // treat paragraphs as crlf
                    outText.Write("\r\n");
                    break;
            }

            if (node.HasChildNodes)
            {
                ConvertContentTo(node, outText);
            }
            break;
    }
}


private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
    foreach (HtmlNode subnode in node.ChildNodes)
    {
        ConvertTo(subnode, outText);
    }
}
箜明 2024-08-25 22:03:56

您可以使用支持从 HTML 中提取文本的 NUglify

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

由于它使用 HTML5 自定义解析器,因此应该是相当健壮(特别是如果文档不包含任何错误)并且速度非常快(不涉及正则表达式,而是纯粹的递归下降解析器)

You can use NUglify that supports text extraction from HTML:

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser)

鲸落 2024-08-25 22:03:56

您需要使用 HTML Agility Pack

您可能希望使用 LINQ 和 Descendants 调用来查找元素,然后获取其 InnerText

You need to use the HTML Agility Pack.

You probably want to find an element using LINQ ant the Descendants call, then get its InnerText.

迎风吟唱 2024-08-25 22:03:56

这是我正在使用的代码:

using System.Web;
public static string ExtractText(string html)
{
    Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    string s =reg.Replace(html, " ");
    s = HttpUtility.HtmlDecode(s);
    return s;
}

Here is the code I am using:

using System.Web;
public static string ExtractText(string html)
{
    Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    string s =reg.Replace(html, " ");
    s = HttpUtility.HtmlDecode(s);
    return s;
}
┈┾☆殇 2024-08-25 22:03:56

相对简单,如果将 HTML 加载到 C# 中,然后使用 C#/WinForms 中的 mshtml.dll 或 WebBrowser 控件,则可以将整个 HTML 文档视为一棵树,遍历树捕获 InnerText 对象。

或者,您也可以使用 document.all,它获取树,将其展平,然后您可以迭代树,再次捕获 InnerText。

这是一个例子:

        WebBrowser webBrowser = new WebBrowser();
        webBrowser.Url = new Uri("url_of_file"); //can be remote or local
        webBrowser.DocumentCompleted += delegate
        {
            HtmlElementCollection collection = webBrowser.Document.All;
            List<string> contents = new List<string>();

            /*
             * Adds all inner-text of a tag, including inner-text of sub-tags
             * ie. <html><body><a>test</a><b>test 2</b></body></html> would do:
             * "test test 2" when collection[i] == <html>
             * "test test 2" when collection[i] == <body>
             * "test" when collection[i] == <a>
             * "test 2" when collection[i] == <b>
             */
            for (int i = 0; i < collection.Count; i++)
            {
                if (!string.IsNullOrEmpty(collection[i].InnerText))
                {
                    contents.Add(collection[i].InnerText);
                }
            }

            /*
             * <html><body><a>test</a><b>test 2</b></body></html>
             * outputs: test test 2|test test 2|test|test 2
             */
            string contentString = string.Join("|", contents.ToArray());
            MessageBox.Show(contentString);
        };

希望有帮助!

It's relatively simple if you load the HTML into C# and then using the mshtml.dll or the WebBrowser control in C#/WinForms, you can then treat the entire HTML document as a tree, traverse the tree capturing the InnerText objects.

Or, you could also use document.all, which takes the tree, flattens it, and then you can iterate across the tree, again capturing the InnerText.

Here's an example:

        WebBrowser webBrowser = new WebBrowser();
        webBrowser.Url = new Uri("url_of_file"); //can be remote or local
        webBrowser.DocumentCompleted += delegate
        {
            HtmlElementCollection collection = webBrowser.Document.All;
            List<string> contents = new List<string>();

            /*
             * Adds all inner-text of a tag, including inner-text of sub-tags
             * ie. <html><body><a>test</a><b>test 2</b></body></html> would do:
             * "test test 2" when collection[i] == <html>
             * "test test 2" when collection[i] == <body>
             * "test" when collection[i] == <a>
             * "test 2" when collection[i] == <b>
             */
            for (int i = 0; i < collection.Count; i++)
            {
                if (!string.IsNullOrEmpty(collection[i].InnerText))
                {
                    contents.Add(collection[i].InnerText);
                }
            }

            /*
             * <html><body><a>test</a><b>test 2</b></body></html>
             * outputs: test test 2|test test 2|test|test 2
             */
            string contentString = string.Join("|", contents.ToArray());
            MessageBox.Show(contentString);
        };

Hope that helps!

独自唱情﹋歌 2024-08-25 22:03:56

这是最好的方法:

  public static string StripHTML(string HTMLText)
    {
        Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
        return reg.Replace(HTMLText, "");
    }

Here is the Best way:

  public static string StripHTML(string HTMLText)
    {
        Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
        return reg.Replace(HTMLText, "");
    }
冷心人i 2024-08-25 22:03:56

这是我开发的一个类来完成同样的事情。所有可用的 HTML 解析库都太慢了,正则表达式也太慢了。代码注释中解释了功能。根据我的基准测试,在 Amazon 登陆页面(如下所示)上测试时,此代码比 HTML Agility Pack 的等效代码快 10 倍多一点。

/// <summary>
/// The fast HTML text extractor class is designed to, as quickly and as ignorantly as possible,
/// extract text data from a given HTML character array. The class searches for and deletes
/// script and style tags in a first and second pass, with an optional third pass to do the same
/// to HTML comments, and then copies remaining non-whitespace character data to an ouput array.
/// All whitespace encountered is replaced with a single whitespace in to avoid multiple
/// whitespace in the output.
///
/// Note that the returned text content still may have named character and numbered character
/// references within that, when decoded, may produce multiple whitespace.
/// </summary>
public class FastHtmlTextExtractor
{

    private readonly char[] SCRIPT_OPEN_TAG = new char[7] { '<', 's', 'c', 'r', 'i', 'p', 't' };
    private readonly char[] SCRIPT_CLOSE_TAG = new char[9] { '<', '/', 's', 'c', 'r', 'i', 'p', 't', '>' };

    private readonly char[] STYLE_OPEN_TAG = new char[6] { '<', 's', 't', 'y', 'l', 'e' };
    private readonly char[] STYLE_CLOSE_TAG = new char[8] { '<', '/', 's', 't', 'y', 'l', 'e', '>' };

    private readonly char[] COMMENT_OPEN_TAG = new char[3] { '<', '!', '-' };
    private readonly char[] COMMENT_CLOSE_TAG = new char[3] { '-', '-', '>' };

    private int[] m_deletionDictionary;

    public string Extract(char[] input, bool stripComments = false)
    {
        var len = input.Length;
        int next = 0;

        m_deletionDictionary = new int[len];

        // Whipe out all text content between style and script tags.
        FindAndWipe(SCRIPT_OPEN_TAG, SCRIPT_CLOSE_TAG, input);
        FindAndWipe(STYLE_OPEN_TAG, STYLE_CLOSE_TAG, input);

        if(stripComments)
        {
            // Whipe out everything between HTML comments.
            FindAndWipe(COMMENT_OPEN_TAG, COMMENT_CLOSE_TAG, input);
        }

        // Whipe text between all other tags now.
        while(next < len)
        {
            next = SkipUntil(next, '<', input);

            if(next < len)
            {
                var closeNext = SkipUntil(next, '>', input);

                if(closeNext < len)
                {
                    m_deletionDictionary[next] = (closeNext + 1) - next;
                    WipeRange(next, closeNext + 1, input);
                }

                next = closeNext + 1;
            }
        }

        // Collect all non-whitespace and non-null chars into a new
        // char array. All whitespace characters are skipped and replaced
        // with a single space char. Multiple whitespace is ignored.
        var lastSpace = true;
        var extractedPos = 0;
        var extracted = new char[len];

        for(next = 0; next < len; ++next)
        {
            if(m_deletionDictionary[next] > 0)
            {
                next += m_deletionDictionary[next];
                continue;
            }

            if(char.IsWhiteSpace(input[next]) || input[next] == '\0')
            {
                if(lastSpace)
                {
                    continue;
                }

                extracted[extractedPos++] = ' ';
                lastSpace = true;
            }
            else
            {
                lastSpace = false;
                extracted[extractedPos++] = input[next];
            }
        }

        return new string(extracted, 0, extractedPos);
    }

    /// <summary>
    /// Does a search in the input array for the characters in the supplied open and closing tag
    /// char arrays. Each match where both tag open and tag close are discovered causes the text
    /// in between the matches to be overwritten by Array.Clear().
    /// </summary>
    /// <param name="openingTag">
    /// The opening tag to search for.
    /// </param>
    /// <param name="closingTag">
    /// The closing tag to search for.
    /// </param>
    /// <param name="input">
    /// The input to search in.
    /// </param>
    private void FindAndWipe(char[] openingTag, char[] closingTag, char[] input)
    {
        int len = input.Length;
        int pos = 0;

        do
        {
            pos = FindNext(pos, openingTag, input);

            if(pos < len)
            {
                var closenext = FindNext(pos, closingTag, input);

                if(closenext < len)
                {
                    m_deletionDictionary[pos - openingTag.Length] = closenext - (pos - openingTag.Length);
                    WipeRange(pos - openingTag.Length, closenext, input);
                }

                if(closenext > pos)
                {
                    pos = closenext;
                }
                else
                {
                    ++pos;
                }
            }
        }
        while(pos < len);
    }

    /// <summary>
    /// Skips as many characters as possible within the input array until the given char is
    /// found. The position of the first instance of the char is returned, or if not found, a
    /// position beyond the end of the input array is returned.
    /// </summary>
    /// <param name="pos">
    /// The starting position to search from within the input array.
    /// </param>
    /// <param name="c">
    /// The character to find.
    /// </param>
    /// <param name="input">
    /// The input to search within.
    /// </param>
    /// <returns>
    /// The position of the found character, or an index beyond the end of the input array.
    /// </returns>
    private int SkipUntil(int pos, char c, char[] input)
    {
        if(pos >= input.Length)
        {
            return pos;
        }

        do
        {
            if(input[pos] == c)
            {
                return pos;
            }

            ++pos;
        }
        while(pos < input.Length);

        return pos;
    }

    /// <summary>
    /// Clears a given range in the input array.
    /// </summary>
    /// <param name="start">
    /// The start position from which the array will begin to be cleared.
    /// </param>
    /// <param name="end">
    /// The end position in the array, the position to clear up-until.
    /// </param>
    /// <param name="input">
    /// The source array wherin the supplied range will be cleared.
    /// </param>
    /// <remarks>
    /// Note that the second parameter is called end, not lenghth. This parameter is meant to be
    /// a position in the array, not the amount of entries in the array to clear.
    /// </remarks>
    private void WipeRange(int start, int end, char[] input)
    {
        Array.Clear(input, start, end - start);
    }

    /// <summary>
    /// Finds the next occurance of the supplied char array within the input array. This search
    /// ignores whitespace.
    /// </summary>
    /// <param name="pos">
    /// The position to start searching from.
    /// </param>
    /// <param name="what">
    /// The sequence of characters to find.
    /// </param>
    /// <param name="input">
    /// The input array to perform the search on.
    /// </param>
    /// <returns>
    /// The position of the end of the first matching occurance. That is, the returned position
    /// points to the very end of the search criteria within the input array, not the start. If
    /// no match could be found, a position beyond the end of the input array will be returned.
    /// </returns>
    public int FindNext(int pos, char[] what, char[] input)
    {
        do
        {
            if(Next(ref pos, what, input))
            {
                return pos;
            }
            ++pos;
        }
        while(pos < input.Length);

        return pos;
    }

    /// <summary>
    /// Probes the input array at the given position to determine if the next N characters
    /// matches the supplied character sequence. This check ignores whitespace.
    /// </summary>
    /// <param name="pos">
    /// The position at which to check within the input array for a match to the supplied
    /// character sequence.
    /// </param>
    /// <param name="what">
    /// The character sequence to attempt to match. Note that whitespace between characters
    /// within the input array is accebtale.
    /// </param>
    /// <param name="input">
    /// The input array to check within.
    /// </param>
    /// <returns>
    /// True if the next N characters within the input array matches the supplied search
    /// character sequence. Returns false otherwise.
    /// </returns>
    public bool Next(ref int pos, char[] what, char[] input)
    {
        int z = 0;

        do
        {
            if(char.IsWhiteSpace(input[pos]) || input[pos] == '\0')
            {
                ++pos;
                continue;
            }

            if(input[pos] == what[z])
            {
                ++z;
                ++pos;
                continue;
            }

            return false;
        }
        while(pos < input.Length && z < what.Length);

        return z == what.Length;
    }
}

等效于 HtmlAgilityPack:

// Where m_whitespaceRegex is a Regex with [\s].
// Where sampleHtmlText is a raw HTML string.

var extractedSampleText = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(sampleHtmlText);

if(doc != null && doc.DocumentNode != null)
{
    foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
    {
        script.Remove();
    }

    foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
    {
        style.Remove();
    }

    var allTextNodes = doc.DocumentNode.SelectNodes("//text()");
    if(allTextNodes != null && allTextNodes.Count > 0)
    {
        foreach(HtmlNode node in allTextNodes)
        {
            extractedSampleText.Append(node.InnerText);
        }
    }

    var finalText = m_whitespaceRegex.Replace(extractedSampleText.ToString(), " ");
}

Here's a class I developed to accomplish the same thing. All available HTML parsing libraries were far too slow, regex was far too slow as well. Functionality is explained in the code comments. From my benchmarks, this code is a little over 10X faster than HTML Agility Pack's equivalent code when tested on Amazon's landing page (included below).

/// <summary>
/// The fast HTML text extractor class is designed to, as quickly and as ignorantly as possible,
/// extract text data from a given HTML character array. The class searches for and deletes
/// script and style tags in a first and second pass, with an optional third pass to do the same
/// to HTML comments, and then copies remaining non-whitespace character data to an ouput array.
/// All whitespace encountered is replaced with a single whitespace in to avoid multiple
/// whitespace in the output.
///
/// Note that the returned text content still may have named character and numbered character
/// references within that, when decoded, may produce multiple whitespace.
/// </summary>
public class FastHtmlTextExtractor
{

    private readonly char[] SCRIPT_OPEN_TAG = new char[7] { '<', 's', 'c', 'r', 'i', 'p', 't' };
    private readonly char[] SCRIPT_CLOSE_TAG = new char[9] { '<', '/', 's', 'c', 'r', 'i', 'p', 't', '>' };

    private readonly char[] STYLE_OPEN_TAG = new char[6] { '<', 's', 't', 'y', 'l', 'e' };
    private readonly char[] STYLE_CLOSE_TAG = new char[8] { '<', '/', 's', 't', 'y', 'l', 'e', '>' };

    private readonly char[] COMMENT_OPEN_TAG = new char[3] { '<', '!', '-' };
    private readonly char[] COMMENT_CLOSE_TAG = new char[3] { '-', '-', '>' };

    private int[] m_deletionDictionary;

    public string Extract(char[] input, bool stripComments = false)
    {
        var len = input.Length;
        int next = 0;

        m_deletionDictionary = new int[len];

        // Whipe out all text content between style and script tags.
        FindAndWipe(SCRIPT_OPEN_TAG, SCRIPT_CLOSE_TAG, input);
        FindAndWipe(STYLE_OPEN_TAG, STYLE_CLOSE_TAG, input);

        if(stripComments)
        {
            // Whipe out everything between HTML comments.
            FindAndWipe(COMMENT_OPEN_TAG, COMMENT_CLOSE_TAG, input);
        }

        // Whipe text between all other tags now.
        while(next < len)
        {
            next = SkipUntil(next, '<', input);

            if(next < len)
            {
                var closeNext = SkipUntil(next, '>', input);

                if(closeNext < len)
                {
                    m_deletionDictionary[next] = (closeNext + 1) - next;
                    WipeRange(next, closeNext + 1, input);
                }

                next = closeNext + 1;
            }
        }

        // Collect all non-whitespace and non-null chars into a new
        // char array. All whitespace characters are skipped and replaced
        // with a single space char. Multiple whitespace is ignored.
        var lastSpace = true;
        var extractedPos = 0;
        var extracted = new char[len];

        for(next = 0; next < len; ++next)
        {
            if(m_deletionDictionary[next] > 0)
            {
                next += m_deletionDictionary[next];
                continue;
            }

            if(char.IsWhiteSpace(input[next]) || input[next] == '\0')
            {
                if(lastSpace)
                {
                    continue;
                }

                extracted[extractedPos++] = ' ';
                lastSpace = true;
            }
            else
            {
                lastSpace = false;
                extracted[extractedPos++] = input[next];
            }
        }

        return new string(extracted, 0, extractedPos);
    }

    /// <summary>
    /// Does a search in the input array for the characters in the supplied open and closing tag
    /// char arrays. Each match where both tag open and tag close are discovered causes the text
    /// in between the matches to be overwritten by Array.Clear().
    /// </summary>
    /// <param name="openingTag">
    /// The opening tag to search for.
    /// </param>
    /// <param name="closingTag">
    /// The closing tag to search for.
    /// </param>
    /// <param name="input">
    /// The input to search in.
    /// </param>
    private void FindAndWipe(char[] openingTag, char[] closingTag, char[] input)
    {
        int len = input.Length;
        int pos = 0;

        do
        {
            pos = FindNext(pos, openingTag, input);

            if(pos < len)
            {
                var closenext = FindNext(pos, closingTag, input);

                if(closenext < len)
                {
                    m_deletionDictionary[pos - openingTag.Length] = closenext - (pos - openingTag.Length);
                    WipeRange(pos - openingTag.Length, closenext, input);
                }

                if(closenext > pos)
                {
                    pos = closenext;
                }
                else
                {
                    ++pos;
                }
            }
        }
        while(pos < len);
    }

    /// <summary>
    /// Skips as many characters as possible within the input array until the given char is
    /// found. The position of the first instance of the char is returned, or if not found, a
    /// position beyond the end of the input array is returned.
    /// </summary>
    /// <param name="pos">
    /// The starting position to search from within the input array.
    /// </param>
    /// <param name="c">
    /// The character to find.
    /// </param>
    /// <param name="input">
    /// The input to search within.
    /// </param>
    /// <returns>
    /// The position of the found character, or an index beyond the end of the input array.
    /// </returns>
    private int SkipUntil(int pos, char c, char[] input)
    {
        if(pos >= input.Length)
        {
            return pos;
        }

        do
        {
            if(input[pos] == c)
            {
                return pos;
            }

            ++pos;
        }
        while(pos < input.Length);

        return pos;
    }

    /// <summary>
    /// Clears a given range in the input array.
    /// </summary>
    /// <param name="start">
    /// The start position from which the array will begin to be cleared.
    /// </param>
    /// <param name="end">
    /// The end position in the array, the position to clear up-until.
    /// </param>
    /// <param name="input">
    /// The source array wherin the supplied range will be cleared.
    /// </param>
    /// <remarks>
    /// Note that the second parameter is called end, not lenghth. This parameter is meant to be
    /// a position in the array, not the amount of entries in the array to clear.
    /// </remarks>
    private void WipeRange(int start, int end, char[] input)
    {
        Array.Clear(input, start, end - start);
    }

    /// <summary>
    /// Finds the next occurance of the supplied char array within the input array. This search
    /// ignores whitespace.
    /// </summary>
    /// <param name="pos">
    /// The position to start searching from.
    /// </param>
    /// <param name="what">
    /// The sequence of characters to find.
    /// </param>
    /// <param name="input">
    /// The input array to perform the search on.
    /// </param>
    /// <returns>
    /// The position of the end of the first matching occurance. That is, the returned position
    /// points to the very end of the search criteria within the input array, not the start. If
    /// no match could be found, a position beyond the end of the input array will be returned.
    /// </returns>
    public int FindNext(int pos, char[] what, char[] input)
    {
        do
        {
            if(Next(ref pos, what, input))
            {
                return pos;
            }
            ++pos;
        }
        while(pos < input.Length);

        return pos;
    }

    /// <summary>
    /// Probes the input array at the given position to determine if the next N characters
    /// matches the supplied character sequence. This check ignores whitespace.
    /// </summary>
    /// <param name="pos">
    /// The position at which to check within the input array for a match to the supplied
    /// character sequence.
    /// </param>
    /// <param name="what">
    /// The character sequence to attempt to match. Note that whitespace between characters
    /// within the input array is accebtale.
    /// </param>
    /// <param name="input">
    /// The input array to check within.
    /// </param>
    /// <returns>
    /// True if the next N characters within the input array matches the supplied search
    /// character sequence. Returns false otherwise.
    /// </returns>
    public bool Next(ref int pos, char[] what, char[] input)
    {
        int z = 0;

        do
        {
            if(char.IsWhiteSpace(input[pos]) || input[pos] == '\0')
            {
                ++pos;
                continue;
            }

            if(input[pos] == what[z])
            {
                ++z;
                ++pos;
                continue;
            }

            return false;
        }
        while(pos < input.Length && z < what.Length);

        return z == what.Length;
    }
}

Equivalent in HtmlAgilityPack:

// Where m_whitespaceRegex is a Regex with [\s].
// Where sampleHtmlText is a raw HTML string.

var extractedSampleText = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(sampleHtmlText);

if(doc != null && doc.DocumentNode != null)
{
    foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
    {
        script.Remove();
    }

    foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
    {
        style.Remove();
    }

    var allTextNodes = doc.DocumentNode.SelectNodes("//text()");
    if(allTextNodes != null && allTextNodes.Count > 0)
    {
        foreach(HtmlNode node in allTextNodes)
        {
            extractedSampleText.Append(node.InnerText);
        }
    }

    var finalText = m_whitespaceRegex.Replace(extractedSampleText.ToString(), " ");
}
一影成城 2024-08-25 22:03:56

在这里,您可以下载一个在 HTML 和 XAML 之间进行转换的工具及其源代码: XAML/HTML 转换器

它包含一个 HTML 解析器(这样的东西显然必须比标准 XML 解析器更宽容)并且您可以像 XML 一样遍历 HTML。

Here you can download a tool and its source that converts to and fro HTML and XAML: XAML/HTML converter.

It contains a HTML parser (such a thing must obviously be much more tolerant than your standard XML parser) and you can traverse the HTML much similar to XML.

巴黎夜雨 2024-08-25 22:03:56

从命令行,您可以使用 Lynx 文本浏览器 像这样

如果您想下载格式化输出的网页(即没有 HTML 标记,而是像 Lynx 中显示的那样),请输入:

lynx -dump URL > filename

如果页面上有任何链接,这些链接的 URL 将包含在下载页面的末尾。

您可以使用 -nolist禁用链接列表。例如:

lynx -dump -nolist http://stackoverflow.com/a/10469619/724176 > filename

From the command line, you can use the Lynx text browser like this:

If you want to download a web page in formatted output (i.e., without HTML tags, but instead as it would appear in Lynx), then enter:

lynx -dump URL > filename

If there are any links on the page, the URLs for those links will be included at the end of the downloaded page.

You can disable the list of links with -nolist. For example:

lynx -dump -nolist http://stackoverflow.com/a/10469619/724176 > filename
一个人的夜不怕黑 2024-08-25 22:03:56

尝试下一个代码

string? GetBodyPreview(string? htmlBody)
{
    Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    htmlBody = reg.Replace(Crop(htmlBody, "<body ", 1000), "");
    return Crop(HttpUtility.HtmlDecode(htmlBody), "", 255);

    string Crop(string? text, string start, int maxLength)
    {
        var s = text?.IndexOf(start);
        var r = (s >= 0 ? text?.Substring(text.IndexOf(start)) : text) ?? string.Empty;
        return r.Substring(0, Int32.Min(r.Length, maxLength)).TrimStart();
    }
}

try next code

string? GetBodyPreview(string? htmlBody)
{
    Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    htmlBody = reg.Replace(Crop(htmlBody, "<body ", 1000), "");
    return Crop(HttpUtility.HtmlDecode(htmlBody), "", 255);

    string Crop(string? text, string start, int maxLength)
    {
        var s = text?.IndexOf(start);
        var r = (s >= 0 ? text?.Substring(text.IndexOf(start)) : text) ?? string.Empty;
        return r.Substring(0, Int32.Min(r.Length, maxLength)).TrimStart();
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文