尝试从 HTML 片段中提取文本时遇到问题
我正在使用 HTML 敏捷包转换
<font size="1">This is a test</font>
为
This is a test
使用此代码:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string stripped = doc.DocumentNode.InnerText;
但我遇到了一个问题:
<font size="1">This is a test & this is a joke</font>
上面的代码将其转换为
This is a test & this is a joke
但我希望将其转换为:
This is a test & this is a joke
html 敏捷包是否支持我的内容想做什么?为什么 HTML 敏捷代码默认不执行此操作,或者我做错了什么?
i am using the HTML Agility pack to convert
<font size="1">This is a test</font>
to
This is a test
using this code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string stripped = doc.DocumentNode.InnerText;
but i ran into an issue where i have this:
<font size="1">This is a test & this is a joke</font>
and the code above converted this to
This is a test & this is a joke
but i wanted it to convert it to:
This is a test & this is a joke
does the html agility pack support what i am trying to do? why doesn't the HTML agiligy code do this by default or i am doing something wrong ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以在输出上运行
HttpUtility.HtmlDecode()
。但请注意,
InnerText
将包含可能包含在最外层标记内的 HTML 标记。如果您想删除所有标签,则必须遍历文档树并一点一点地检索所有文本。You can run
HttpUtility.HtmlDecode()
on the output.However, note that
InnerText
will include HTML tags that may be contained inside the outermost tag. If you want to remove all tags, you will have to walk the document tree and retrieve all the text bit by bit.