HtmlAgilityPack 和 HtmlDecode
我目前正在使用 HtmlAgilityPack 和控制台应用程序来抓取网站。由于 html 已编码(它返回编码字符,例如 '
),因此我必须在将内容保存到数据库之前进行解码。
有没有办法使用 HtmlAgilityPack 解码返回的 html,而不必使用 HttpUtility.HtmlDecode?如果可能的话,我想避免将 System.Web 添加到我的控制台应用程序中。
I am currently using HtmlAgilityPack with a console application to scrape a website. Since the html is encoded (it returns encoded characters like '
) I have to decode before I save the content to my database.
Is there a way to decode the returned html using HtmlAgilityPack without having to use HttpUtility.HtmlDecode? I want to avoid adding System.Web to my console application if possible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Html Agility Pack 配备了一个名为
HtmlEntity
的实用程序类。它有一个带有以下签名的静态方法:它支持众所周知的实体(如
)和编码字符,如
'
。The Html Agility Pack is equiped with a utility class called
HtmlEntity
. It has a static method with the following signature:It supports well-known entities (like
) and encoded characters such as
'
as well.只是添加我的 2 美分:我使用
StopWatch
类运行了一些性能测试,发现HttpUtility.HtmlDecode
大约快了 15-20%比DeEntitize
方法更有效。此外,DeEntitize
有一些错误(请参阅上面的评论)。所以也许引用 System.Web 并不是那么糟糕。
如果您正在编写一个已经针对“.NET full”(而不是“.NET Client Profile”——这是一个轻量级版本)的应用程序——我会去引用 System.Web。
Just adding my 2 cents: I've ran some performance tests using
StopWatch
class and found thatHttpUtility.HtmlDecode
is about 15-20% faster than theDeEntitize
method. AlsoDeEntitize
has some bugs (see comments above).So maybe referencing System.Web is not that bad after all.
If you're writing an app that already targets ".NET full" (opposed to ".NET Client Profile" - which is a lightweight version) - I'd go for referencing System.Web.
使用 WebUtility不需要任何特殊的参考。
Use WebUtility that doesn't need any special reference.