使用 Html Agility Pack 从 html 中抓取所有文本
输入
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
输出
foo
bar
baz
我知道 htmldoc.DocumentNode.InnerText
,但它会给出 foobarbaz
- 我想获取每个文本,而不是一次获取所有文本。
Input
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
Output
foo
bar
baz
I know of htmldoc.DocumentNode.InnerText
, but it will give foobarbaz
- I want to get each text, not all at a time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
XPATH 是你的朋友:)
XPATH is your friend :)
这满足了您的需要,但我不确定这是否是最好的方法。也许您应该迭代 DescendantNodesAndSelf 以外的其他内容以获得最佳性能。
This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.
我需要一个提取所有文本但丢弃脚本和样式标签内容的解决方案。我在任何地方都找不到它,但我想出了以下适合我自己需求的内容:
I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:
html content: 的指定示例
将产生以下输出:
The specified example for html content:
will produce the following output:
此解决方法基于 Html Agility Pack。您还可以通过 NuGet 安装它(包名称:
HtmlAgilityPack
)。This workaround is based on Html Agility Pack. You can also install it via NuGet (package name:
HtmlAgilityPack
).您尝试过 CsQuery 吗?尽管没有得到积极维护,但它仍然是我最喜欢的将 HTML 解析为文本的方法。下面简单介绍了从 HTML 获取文本是多么简单。
这是一个完整的控制台应用程序:
我知道 OP 仅要求 HtmlAgilityPack,但 CsQuery 是另一个不受欢迎的解决方案,也是我发现的最好的解决方案之一,如果有人觉得这有帮助,我想分享。干杯!
have you tried CsQuery? Though not being maintained actively - it's still my favorite for parsing HTML to Text. Here's a one liner of how simple it is to get the Text from HTML.
Here's a complete console application:
I understand that OP has asked for HtmlAgilityPack only but CsQuery is another unpopular and one of the best solutions I've found and wanted to share if someone finds this helpful. Cheers!
我刚刚更改并修复了一些人的答案以更好地工作:
I just changed and fixed some people's answers to work better:
可能类似于下面的内容(我在谷歌搜索时找到了非常基本的版本,并将其扩展为处理超链接、ul、ol、div、表格)
Possibly something like the below (I found the very basic version while googling and extended it to handle hyperlinks, ul, ol, divs, tables)
然后你需要清理文本并删除过多的空白等。
Then you need to clean up the text and remove excessive whitespace and so on.