屏幕抓取 PDF 链接下载

发布于 2024-10-21 17:01:21 字数 395 浏览 6 评论 0原文

我正在通过创建一个小程序来学习 C#，但找不到类似的帖子（如果这个答案发布在其他地方，我深表歉意）。

我如何在网站上进行屏幕抓取以获取 PDF 链接（然后我可以将其下载到指定位置）？有时，一个页面会有一个指向另一个 HTML 页面的链接，该 HTML 页面具有实际的 PDF 链接，因此，如果在第一页上找不到实际的 PDF，我希望它自动查找在页面中包含“PDF”的链接链接的文本，然后在生成的 HTML 页面中搜索真正的 PDF 链接。

我知道我可能可以通过谷歌搜索文件类型来实现类似的东西，但这对我来说似乎是“作弊”:)我宁愿学习如何在代码中做到这一点，但我不知道从哪里开始。我对使用 XElement 等进行 XML 解析有点熟悉，但我不确定如何从 HTML 页面（或其他格式？）获取链接。

有人能指出我正确的方向吗？谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伤痕我心 2024-10-28 17:01:21

HtmlAgilityPack 非常适合此类内容。

实现示例：

string pdfLinksUrl = "http://www.google.com/search?q=filetype%3Apdf";

// Load HTML content    
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);

// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");

// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
    let href = linkNode.Attributes["href"].Value
    where href.ToLower().EndsWith(".pdf")
    select href;

// write all PDF links to file
System.IO.File.WriteAllLines(@"c:\pdflinks.txt", pdfUrls.ToArray());

附带说明一下，我不会过多依赖 HAP 中的 XPath 表达式。缺少一些 XPath 函数，将所有提取逻辑放入 XPath 中将使您的代码更难以维护。我将使用 XPath 表达式提取相当少的内容，然后通过迭代节点集合来执行所有所需的提取（Linq 方法有很大帮助）。

HAP 的真正强大之处在于能够解析 SGML 文档，即从 XHTML 角度来看可能无效的内容（未闭合的标签、缺少引号等）。

HtmlAgilityPack is great for this kind of stuff.

Example of implementation:

string pdfLinksUrl = "http://www.google.com/search?q=filetype%3Apdf";

// Load HTML content    
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);

// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");

// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
    let href = linkNode.Attributes["href"].Value
    where href.ToLower().EndsWith(".pdf")
    select href;

// write all PDF links to file
System.IO.File.WriteAllLines(@"c:\pdflinks.txt", pdfUrls.ToArray());

As a side note, I would not rely too much on XPath expressions in HAP. There are some XPath functions missing, and putting all extraction logic inside your XPath will make your code less maintainable. I would extract a fair minimum using an XPath expression, and then do all required extraction by iterating through the node collection (Linq methods help a lot).

The real power of the HAP is the ability to parse SGML documents, that is, something which can be invalid from the XHTML point of view (unclosed tags, missing quotes, etc.).

回复收藏 0 原文