如何通过 Html Agility Pack 从网页中提取链接、文本和时间戳
我正在使用 Html Agility Pack,并尝试从以下 html 代码中提取链接和链接文本。该网页是从远程页面获取并整体保存在本地的。然后,我尝试从这个本地网页提取链接和链接文本。该网页在其页面内自然有其他 html 代码,如其他链接文本等,但为了清楚起见,此处将其删除。
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open">
Description 1 text here</span> <span class="time">2012-01-20 08:35</span></a><br>
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open">
Description 2 text here</span> <span class="time">2012-01-20 09:35</span></a><br>
但在尝试提取链接和链接文本时,以上内容是最独特的内容。
这就是我希望看到的结果
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305</link>
<title>Description 1 text here</title>
<pubDate>Wed, 20 Jan 2012 07:35:00 +0100</pubDate>
<link>/some/today.nsf/0/ EC8A39XXXX264X5BC125798B0029E312</link>
<title>Description 2 text here</title>
<pubDate> Wed, 20 Jan 2012 08:35:00 +0100</pubDate>
这是我到目前为止的代码:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[starts-with(@class, 'Subject2')]")
(lnks.Name == "a" &&
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0)
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
Time = lnks. Attributes["time"].Value
};
foreach (var link in linksOnPage)
{
// Loop through.
Response.Write("<link>" + link.Url + "</link>");
Response.Write("<title>" + link.Text + "</title>");
Response.Write("<pubDate>" + link.Time + "</pubDate>");
}
它不起作用,我什么也没得到。
因此,任何建议和帮助将不胜感激。
提前致谢。
更新:我现在已经设法使语法正确,以便从上面的示例中选择链接:使用以下代码:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[@class='Subject2']//a")
这可以很好地选择带有网址和文本的链接,但是我该怎么办还要获取时间戳吗?
也就是说,从中选择
<span class="time">2012-01-20 09:35</span></a>
每个链接后面的时间戳。根据上面的内容,输出循环内的每个链接都有该输出吗?感谢您对此的任何帮助。
I am using Html Agility Pack and are trying to extract the links and link text from the following html code. The webpage is fetched from a remote page and the saved locally as a whole. Then from this local webpage I am trying to extract the links and link text. The webpage naturally has other html code like other links text, etc inside its page but is removed here for clarity.
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open">
Description 1 text here</span> <span class="time">2012-01-20 08:35</span></a><br>
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open">
Description 2 text here</span> <span class="time">2012-01-20 09:35</span></a><br>
But the above are the most unique content to work from when trying to extract the links and linktext.
This is what I would like to see as the result
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305</link>
<title>Description 1 text here</title>
<pubDate>Wed, 20 Jan 2012 07:35:00 +0100</pubDate>
<link>/some/today.nsf/0/ EC8A39XXXX264X5BC125798B0029E312</link>
<title>Description 2 text here</title>
<pubDate> Wed, 20 Jan 2012 08:35:00 +0100</pubDate>
This is my code so far:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[starts-with(@class, 'Subject2')]")
(lnks.Name == "a" &&
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0)
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
Time = lnks. Attributes["time"].Value
};
foreach (var link in linksOnPage)
{
// Loop through.
Response.Write("<link>" + link.Url + "</link>");
Response.Write("<title>" + link.Text + "</title>");
Response.Write("<pubDate>" + link.Time + "</pubDate>");
}
And its not working, I am getting nothing.
So any suggestions and help would be highly appreciated.
Thanks in advance.
Update: I have managed to get the syntax correct now, in order to select the links from the above examples: With the following code:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[@class='Subject2']//a")
This selects the links nicely with url and text, but how do I go about also getting the time stamp?
That is, select the timestamp out of this:
<span class="time">2012-01-20 09:35</span></a>
which follows each link. And have that output with each link inside the output loop according to the above? Thanks for any help in regards to this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的 HTML 示例格式错误,这就是您得到意外结果的原因。
要找到第一个和第二个值,您必须在
中获取
- 第一个值是 a
href
属性值,第二个是锚点的InnerText
。要获取第三个值,您必须获取标记的以下同级标记并获取其
InnerText
。看,你可以这样做:
输出如下:
Your HTML example is malformed, that's why you get unexpected results.
To find your first and second values you'll have to get the
<a>
inside your<span class='Subject2'>
- the first value is ahref
attribute value, the second isInnerText
of the anchor. To get the third value you'll have to get the following sibling of the<span class='Subject2'>
tag and get itsInnerText
.See, this how you can do it:
this outputs something like: