如何从 html 中提取文本
我需要提取 html 的 中存在的所有文本。示例 Html 输入:-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
输出应该是:-
This is a big title. How are doing you? I am fine
我想仅使用 HtmlAgility 来实现此目的。请不要使用正则表达式。
我知道如何加载 HtmlDocument,然后使用像“//body”这样的 xquery 我们可以获得正文内容。但是如何去除输出中显示的 html 内容呢?
提前致谢 :)
I have a requirement to extract all the text that is present in the <body>
of the html. Sample Html input :-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以使用正文的
InnerText
:接下来,您可能想要折叠空格和新行:
但是请注意,虽然在这种情况下有效,但诸如
hello
或world 之类的标记
helloworld
将由InnerText
转换为helloworld
- 删除标签。解决这个问题很困难,因为显示通常由 CSS 决定,而不仅仅是由标记决定。You can use the body's
InnerText
:Next, you may want to collapse spaces and new lines:
Note, however, that while it is working in this case, markup such as
hello<br>world
orhello<i>world</i>
will be converted byInnerText
tohelloworld
- removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.如何使用 XPath 表达式
'//body//text()'
选择所有文本节点?How about using the XPath expression
'//body//text()'
to select all text nodes?您可以使用支持从 HTML 中提取文本的 NUglify :
由于它使用 HTML5 自定义解析器,因此应该是相当健壮(特别是如果文档不包含任何错误)并且速度非常快(不涉及正则表达式,而是一个纯粹的递归下降解析器,比 HtmlAgilityPack 更快,并且对 GC 更友好)
You can use NUglify that supports text extraction from HTML:
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)
通常,对于解析 html,我会推荐一个 HTML 解析器,但是由于您想删除所有 html 标签,因此一个简单的正则表达式应该可以工作。
Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.