如何使用 Html Agility Pack 获取 img/src 或 a/hrefs？

发布于 2024-10-15 01:13:13 字数 154 浏览 7 评论 0原文

我想使用 HTML 敏捷包来解析 HTML 页面中的图像和 href 链接，但我对 XML 或 XPath 不太了解。虽然在许多网站上查找了帮助文档，但我无法解决另外，我在VisualStudio 2005中使用C#，而且我的英语说得不太流利，所以，我要对能写出一些有用代码的人致以诚挚的谢意。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

热风软妹 2024-10-22 01:13:13

主页上的第一个示例的作用非常相似，但请考虑：

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
doc.OptionEmptyCollection = true; // avoid null reference exception
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
   string href = link.Attributes["href"].Value;
   // store href somewhere
}

所以你可以想象一下，对于 img@src，只需将每个 a 替换为 img，将 href 替换为 src。
您甚至可以简化为：

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a/@href | //img/@src"))
{
    HtmlAttribute href = node.Attributes["href"];
    HtmlAttribute src = node.Attributes["src"];
    list.Add((href ?? src).Value);
}

对于相对 url 处理，请查看 Uri 类。

The first example on the home page does something very similar, but consider:

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
doc.OptionEmptyCollection = true; // avoid null reference exception
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
   string href = link.Attributes["href"].Value;
   // store href somewhere
}

So you can imagine that for img@src, just replace each a with img, and href with src.
You might even be able to simplify to:

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a/@href | //img/@src"))
{
    HtmlAttribute href = node.Attributes["href"];
    HtmlAttribute src = node.Attributes["src"];
    list.Add((href ?? src).Value);
}

For relative url handling, look at the Uri class.

回复收藏 0 原文

源来凯始玺欢你 2024-10-22 01:13:13

这个例子和接受的答案是错误的。它无法使用最新版本进行编译。我尝试其他方法：

    private List<string> ParseLinks(string html)
    {
        var doc = new HtmlDocument(); 
        doc.LoadHtml(html);
        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
        return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
               r => r.Attributes.ToList().ConvertAll(
               i => i.Value)).SelectMany(j => j).ToList();
    }

这对我有用。

The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:

    private List<string> ParseLinks(string html)
    {
        var doc = new HtmlDocument(); 
        doc.LoadHtml(html);
        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
        return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
               r => r.Attributes.ToList().ConvertAll(
               i => i.Value)).SelectMany(j => j).ToList();
    }

This works for me.

回复收藏 0 原文

半城柳色半声笛 2024-10-22 01:13:13

也许我来得太晚了，无法发表答案。以下对我有用：

var MainImageString  = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();

Maybe I am too late here to post an answer. The following worked for me:

var MainImageString  = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();

回复收藏 0 原文

满天都是小星星 2024-10-22 01:13:13

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

string name = htmlDoc.DocumentNode
    .SelectNodes("//td/input")
    .First()
    .Attributes["value"].Value;

来源：
https://html-agility-pack.net/select-nodes

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

string name = htmlDoc.DocumentNode
    .SelectNodes("//td/input")
    .First()
    .Attributes["value"].Value;

Source:
https://html-agility-pack.net/select-nodes

回复收藏 0 原文

凉薄对峙 2024-10-22 01:13:13

您还需要考虑文档基本 URL 元素 () 和协议相对 URL（例如 //www.foo.com/bar/）。

有关详细信息，请检查：

回复收藏 0 原文

红ご颜醉 2024-10-22 01:13:13

较晚发布，但这是对已接受答案的 2021 年更新（修复了 HtmlAgilityPack 所做的重构。

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);
    string command = "";

    // The Xpath below gets images.  
    // It is specific to a site.  Yours will vary ...
    command = "//a[contains(concat(' ', @class, ' '), 'product-card')]//img";  
    List<string> listImages=new();
    foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
    {
        //  Using "data-src" below, but it may be "src" for you
        listImages.Add(node.Attributes["data-src"].Value);
    }

Late post, but here's a 2021 update to the accepted answer (fixes the refactoring that HtmlAgilityPack made.

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);
    string command = "";

    // The Xpath below gets images.  
    // It is specific to a site.  Yours will vary ...
    command = "//a[contains(concat(' ', @class, ' '), 'product-card')]//img";  
    List<string> listImages=new();
    foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
    {
        //  Using "data-src" below, but it may be "src" for you
        listImages.Add(node.Attributes["data-src"].Value);
    }

回复收藏 0 原文

~没有更多了~