如何从 html 中提取文本

发布于 2024-11-04 01:43:07 字数 586 浏览 2 评论 0原文

我需要提取 html 的 中存在的所有文本。示例 Html 输入:-

<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src="abc.jpg"/>
    </body>
</html>

输出应该是:-

This is a big title. How are doing you? I am fine

我想仅使用 HtmlAgility 来实现此目的。请不要使用正则表达式。

我知道如何加载 HtmlDocument,然后使用像“//body”这样的 xquery 我们可以获得正文内容。但是如何去除输出中显示的 html 内容呢?

提前致谢 :)

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-

<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src="abc.jpg"/>
    </body>
</html>

The output should be :-

This is a big title. How are doing you? I am fine

I want to use only HtmlAgility for this purpose. No regular expressions please.

I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?

Thanks in advance :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

筱武穆 2024-11-11 01:43:07

您可以使用正文的 InnerText

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

接下来,您可能想要折叠空格和新行:

text = Regex.Replace(text, @"\s+", " ").Trim();

但是请注意,虽然在这种情况下有效,但诸如 hello
world 之类的标记
helloworld 将由 InnerText 转换为 helloworld - 删除标签。解决这个问题很困难,因为显示通常由 CSS 决定,而不仅仅是由标记决定。

You can use the body's InnerText:

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

Next, you may want to collapse spaces and new lines:

text = Regex.Replace(text, @"\s+", " ").Trim();

Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

梦回梦里 2024-11-11 01:43:07

如何使用 XPath 表达式 '//body//text()' 选择所有文本节点?

How about using the XPath expression '//body//text()' to select all text nodes?

云淡月浅 2024-11-11 01:43:07

您可以使用支持从 HTML 中提取文本的 NUglify

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

由于它使用 HTML5 自定义解析器,因此应该是相当健壮(特别是如果文档不包含任何错误)并且速度非常快(不涉及正则表达式,而是一个纯粹的递归下降解析器,比 HtmlAgilityPack 更快,并且对 GC 更友好)

You can use NUglify that supports text extraction from HTML:

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)

与酒说心事 2024-11-11 01:43:07

通常,对于解析 html,我会推荐一个 HTML 解析器,但是由于您想删除所有 html 标签,因此一个简单的正则表达式应该可以工作。

Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文