如何使用 Html Agility Pack 获取 img/src 或 a/hrefs?

发布于 2024-10-15 01:13:13 字数 154 浏览 7 评论 0原文

我想使用 HTML 敏捷包来解析 HTML 页面中的图像和 href 链接,但我对 XML 或 XPath 不太了解。虽然在许多网站上查找了帮助文档,但我无法解决另外,我在VisualStudio 2005中使用C#,而且我的英语说得不太流利,所以,我要对能写出一些有用代码的人致以诚挚的谢意。

I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

热风软妹 2024-10-22 01:13:13

主页上的第一个示例的作用非常相似,但请考虑:

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
doc.OptionEmptyCollection = true; // avoid null reference exception
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
   string href = link.Attributes["href"].Value;
   // store href somewhere
}

所以你可以想象一下,对于 img@src,只需将每个 a 替换为 img,将 href 替换为 src
您甚至可以简化为:

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a/@href | //img/@src"))
{
    HtmlAttribute href = node.Attributes["href"];
    HtmlAttribute src = node.Attributes["src"];
    list.Add((href ?? src).Value);
}

对于相对 url 处理,请查看 Uri 类。

The first example on the home page does something very similar, but consider:

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
doc.OptionEmptyCollection = true; // avoid null reference exception
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
   string href = link.Attributes["href"].Value;
   // store href somewhere
}

So you can imagine that for img@src, just replace each a with img, and href with src.
You might even be able to simplify to:

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a/@href | //img/@src"))
{
    HtmlAttribute href = node.Attributes["href"];
    HtmlAttribute src = node.Attributes["src"];
    list.Add((href ?? src).Value);
}

For relative url handling, look at the Uri class.

源来凯始玺欢你 2024-10-22 01:13:13

这个例子和接受的答案是错误的。它无法使用最新版本进行编译。我尝试其他方法:

    private List<string> ParseLinks(string html)
    {
        var doc = new HtmlDocument(); 
        doc.LoadHtml(html);
        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
        return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
               r => r.Attributes.ToList().ConvertAll(
               i => i.Value)).SelectMany(j => j).ToList();
    }

这对我有用。

The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:

    private List<string> ParseLinks(string html)
    {
        var doc = new HtmlDocument(); 
        doc.LoadHtml(html);
        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
        return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
               r => r.Attributes.ToList().ConvertAll(
               i => i.Value)).SelectMany(j => j).ToList();
    }

This works for me.

半城柳色半声笛 2024-10-22 01:13:13

也许我来得太晚了,无法发表答案。以下对我有用:

var MainImageString  = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();

Maybe I am too late here to post an answer. The following worked for me:

var MainImageString  = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
满天都是小星星 2024-10-22 01:13:13
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

string name = htmlDoc.DocumentNode
    .SelectNodes("//td/input")
    .First()
    .Attributes["value"].Value;

来源:
https://html-agility-pack.net/select-nodes

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

string name = htmlDoc.DocumentNode
    .SelectNodes("//td/input")
    .First()
    .Attributes["value"].Value;

Source:
https://html-agility-pack.net/select-nodes

凉薄对峙 2024-10-22 01:13:13

您还需要考虑文档基本 URL 元素 () 和协议相对 URL(例如 //www.foo.com/bar/) 。

有关详细信息,请检查:

You also need to take into account the document base URL element (<base>) and protocol relative URLs (for example //www.foo.com/bar/).

For more information check:

红ご颜醉 2024-10-22 01:13:13

较晚发布,但这是对已接受答案的 2021 年更新(修复了 HtmlAgilityPack 所做的重构。

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);
    string command = "";

    // The Xpath below gets images.  
    // It is specific to a site.  Yours will vary ...
    command = "//a[contains(concat(' ', @class, ' '), 'product-card')]//img";  
    List<string> listImages=new();
    foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
    {
        //  Using "data-src" below, but it may be "src" for you
        listImages.Add(node.Attributes["data-src"].Value);
    }

Late post, but here's a 2021 update to the accepted answer (fixes the refactoring that HtmlAgilityPack made.

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);
    string command = "";

    // The Xpath below gets images.  
    // It is specific to a site.  Yours will vary ...
    command = "//a[contains(concat(' ', @class, ' '), 'product-card')]//img";  
    List<string> listImages=new();
    foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
    {
        //  Using "data-src" below, but it may be "src" for you
        listImages.Add(node.Attributes["data-src"].Value);
    }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文