鉴于我将网页源存储在字符串变量中,如何在 C# 中读取 HTML 文档?
我曾尝试自己做这件事但做不到。
我有一个 html 文档,我正在尝试将其中所有图片的地址提取到 ac# 集合中,但我不确定语法。我正在使用 HTMLAgilityPack...这是我到目前为止所拥有的。请指教。
HTML 代码如下:
<div style='padding-left:12px;' id='myWeb123'>
<b>MyWebSite Pics</b>
<br /><br />
<img src="http://myWebSite.com/pics/HHTR_01.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_02.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_03.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_04.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_05.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_06.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_07.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_08.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_09.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_10.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<a href="http://www.myWebSite.com/" target="_blank" rel="nofollow">Source</a>
</div>
C# 代码如下:
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.Load("FileName.html");
// Targets a specific node
HtmlNode someNode = document.GetElementbyId("myWeb123");
//HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//a[@href]");
HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");
if (linkNodes != null)
{
int count = 0;
foreach(HtmlNode linkNode in linkNodes)
{
string linkTitle = linkNode.GetAttributeValue("src", string.Empty);
Debug.Print("linkTitle = " + linkTitle);
if (linkTitle == string.Empty)
{
HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
if (imageNode != null)
{
Debug.Print("imageNode = " + imageNode.Attributes.ToString());
}
}
count++;
Debug.Print("count = " + count);
}
}
我尝试使用 HtmlAgilityPack 文档,但该包缺乏示例,并且有关其方法和类的信息对于我来说真的很难理解,如果没有例子。
I have tried to do this on my own but couldn't.
I have an html document, and I'm trying to extract the addresses for all the pictures in it into a c# collection and I'm not sure of the syntax. I'm using HTMLAgilityPack... Here is what I have so far. Please advise.
The HTML Code is the following:
<div style='padding-left:12px;' id='myWeb123'>
<b>MyWebSite Pics</b>
<br /><br />
<img src="http://myWebSite.com/pics/HHTR_01.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_02.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_03.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_04.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_05.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_06.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_07.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_08.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_09.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_10.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<a href="http://www.myWebSite.com/" target="_blank" rel="nofollow">Source</a>
</div>
And the c# code is the following:
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.Load("FileName.html");
// Targets a specific node
HtmlNode someNode = document.GetElementbyId("myWeb123");
//HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//a[@href]");
HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");
if (linkNodes != null)
{
int count = 0;
foreach(HtmlNode linkNode in linkNodes)
{
string linkTitle = linkNode.GetAttributeValue("src", string.Empty);
Debug.Print("linkTitle = " + linkTitle);
if (linkTitle == string.Empty)
{
HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
if (imageNode != null)
{
Debug.Print("imageNode = " + imageNode.Attributes.ToString());
}
}
count++;
Debug.Print("count = " + count);
}
}
I tried to use the HtmlAgilityPack Documentation but this pack lacks examples and the information about its methods and classes are really hard for me to understand without examples.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
试试这个,抱歉,如果它无法构建,我已经根据您的情况覆盖了我们的代码
try this, sorry if it will not be buildable, I have overwritten our code to your situation
您可以使用
Load
的重载,它需要一个TextReader
:(我没有查看代码的其余部分,但这解决了“如果我已经将 HTML 放入字符串中了吗?”部分。)
You can use the overload of
Load
which takes aTextReader
:(I haven't looked over the rest of the code, but that addresses the "what do I do if I've already got the HTML in a string?" part.)
在此行中:
您选择的是
节点,而不是其下的
节点。试试这个来选择那些 img 节点:
至于选择语法,它与 XML 中使用的 XPath 相同。因此,如果您想要选择的示例,请搜索 XPath。
在这种情况下:
/
从文档的根开始搜索(而不是从某个“当前节点”),//
意味着下一个匹配可以位于任何深度而不是直接在根div[@id='myWeb123']
下搜索属性“id”且值为“myWeb123”的节点'
/img
搜索匹配的 div 节点正下方的 img 节点。In this line:
you are selecting the
<div>
node, not the<img>
nodes under it. Try this to select those img nodes:As for the selection syntax, it's identical to XPath as used in XML. So search for XPath if you want examples of the selection.
In this case:
/
starts searching from the root of the document (instead of from some "currect node")//
means that the next match can be at any depth instead of directly under the rootdiv[@id='myWeb123']
searches for a<div>
node with an attribute 'id' that has value 'myWeb123'/img
searches for an img node directly under the matched div node.如果页面大小增长,像这样使用 Xpath 的成本将会很高。
最好的方法是将 html 反序列化为对象。您也不需要使用您正在使用的 Htmlagility 参考。使用streamreader加载HTML并使用Xmlserializer
使用XSD工具,首先转换为xsd,然后从xsd工具生成一个类
,将该类导入到您的解决方案中,
然后col对象将一次性包含所有img标签的src。
Using Xpath like this will be expensive if the page size grows.
Best is to deserialize the html to an object. You also dont need to use the Htmlagility reference that you are using. Load the HTML using streamreader and the use Xmlserializer
Use XSD tool , first to convert to xsd and then generate a class from the xsd tool
Import this class to your solution
The col object then will contain all the src of the img tags in one shot.