如何从 html 中提取文本

发布于 2024-11-04 01:43:07 字数 586 浏览 2 评论 0原文

我需要提取 html 的中存在的所有文本。示例 Html 输入：-

<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src="abc.jpg"/>
    </body>
</html>

输出应该是：-

This is a big title. How are doing you? I am fine

我想仅使用 HtmlAgility 来实现此目的。请不要使用正则表达式。

我知道如何加载 HtmlDocument，然后使用像“//body”这样的 xquery 我们可以获得正文内容。但是如何去除输出中显示的 html 内容呢？

提前致谢：）

原文

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-

<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src="abc.jpg"/>
    </body>
</html>

The output should be :-

This is a big title. How are doing you? I am fine

I want to use only HtmlAgility for this purpose. No regular expressions please.

I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?

Thanks in advance :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

筱武穆 2024-11-11 01:43:07

您可以使用正文的 InnerText：

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

接下来，您可能想要折叠空格和新行：

text = Regex.Replace(text, @"\s+", " ").Trim();

但是请注意，虽然在这种情况下有效，但诸如 hello world 之类的标记 或 helloworld 将由 InnerText 转换为 helloworld - 删除标签。解决这个问题很困难，因为显示通常由 CSS 决定，而不仅仅是由标记决定。

You can use the body's InnerText:

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

Next, you may want to collapse spaces and new lines:

text = Regex.Replace(text, @"\s+", " ").Trim();

Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

回复收藏 0 原文

梦回梦里 2024-11-11 01:43:07

如何使用 XPath 表达式 '//body//text()' 选择所有文本节点？

回复收藏 0 原文

云淡月浅 2024-11-11 01:43:07

您可以使用支持从 HTML 中提取文本的 NUglify ：

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

由于它使用 HTML5 自定义解析器，因此应该是相当健壮（特别是如果文档不包含任何错误）并且速度非常快（不涉及正则表达式，而是一个纯粹的递归下降解析器，比 HtmlAgilityPack 更快，并且对 GC 更友好）

You can use NUglify that supports text extraction from HTML:

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)

回复收藏 0 原文