屏幕抓取 PDF 链接下载

发布于 2024-10-21 17:01:21 字数 395 浏览 6 评论 0原文

我正在通过创建一个小程序来学习 C#,但找不到类似的帖子(如果这个答案发布在其他地方,我深表歉意)。

我如何在网站上进行屏幕抓取以获取 PDF 链接(然后我可以将其下载到指定位置)?有时,一个页面会有一个指向另一个 HTML 页面的链接,该 HTML 页面具有实际的 PDF 链接,因此,如果在第一页上找不到实际的 PDF,我希望它自动查找在页面中包含“PDF”的链接链接的文本,然后在生成的 HTML 页面中搜索真正的 PDF 链接。

我知道我可能可以通过谷歌搜索文件类型来实现类似的东西,但这对我来说似乎是“作弊”:)我宁愿学习如何在代码中做到这一点,但我不知道从哪里开始。我对使用 XElement 等进行 XML 解析有点熟悉,但我不确定如何从 HTML 页面(或其他格式?)获取链接。

有人能指出我正确的方向吗?谢谢!

I'm learning C# through creating a small program, and couldn't find a similar post (apologies if this answer is posted somewhere else).

How might I go about screen-scraping a website for links to PDFs (which I can then download to a specified location)? Sometimes a page will have a link to another HTML page which has the actual PDF link, so if the actual PDF can't be found on the first page I'd like it to automatically look for a link that have "PDF" in the text of a link, and then search that resulting HTML page for the real PDF link.

I know that I could probably achieve something similar via filetype searching through google, but that seems like "cheating" to me :) I'd rather learn how to do it in code, but I'm not sure where to start. I'm a little familiar with XML parsing with XElement and such, but I'm not sure how to do it for getting links from an HTML page (or other format?).

Could anyone point me in the right direction? Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

伤痕我心 2024-10-28 17:01:21

HtmlAgilityPack 非常适合此类内容。

实现示例:

string pdfLinksUrl = "http://www.google.com/search?q=filetype%3Apdf";

// Load HTML content    
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);

// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");

// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
    let href = linkNode.Attributes["href"].Value
    where href.ToLower().EndsWith(".pdf")
    select href;

// write all PDF links to file
System.IO.File.WriteAllLines(@"c:\pdflinks.txt", pdfUrls.ToArray());

附带说明一下,我不会过多依赖 HAP 中的 XPath 表达式。缺少一些 XPath 函数,将所有提取逻辑放入 XPath 中将使您的代码更难以维护。我将使用 XPath 表达式提取相当少的内容,然后通过迭代节点集合来执行所有所需的提取(Linq 方法有很大帮助)。

HAP 的真正强大之处在于能够解析 SGML 文档,即从 XHTML 角度来看可能无效的内容(未闭合的标签、缺少引号等)。

HtmlAgilityPack is great for this kind of stuff.

Example of implementation:

string pdfLinksUrl = "http://www.google.com/search?q=filetype%3Apdf";

// Load HTML content    
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);

// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");

// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
    let href = linkNode.Attributes["href"].Value
    where href.ToLower().EndsWith(".pdf")
    select href;

// write all PDF links to file
System.IO.File.WriteAllLines(@"c:\pdflinks.txt", pdfUrls.ToArray());

As a side note, I would not rely too much on XPath expressions in HAP. There are some XPath functions missing, and putting all extraction logic inside your XPath will make your code less maintainable. I would extract a fair minimum using an XPath expression, and then do all required extraction by iterating through the node collection (Linq methods help a lot).

The real power of the HAP is the ability to parse SGML documents, that is, something which can be invalid from the XHTML point of view (unclosed tags, missing quotes, etc.).

简单气质女生网名 2024-10-28 17:01:21

您最好的选择可能是使用 HTML Agility 来屏幕抓取页面,然后选择 href 属性以查看是否有效看起来像是 PDF 下载。如果没有,您可以查看节点内的文本中的关键字(例如 PDF)来决定是否点击该链接。

Your best bet is probably to use HTML Agility to screen scrape the page, then select the href attribute to see if it looks like a PDF download. If not, you could then look at the text within the node for keywords such as PDF to decide whether to follow the link or not.

林空鹿饮溪 2024-10-28 17:01:21

要解析任何 HTML 页面,请使用 HtmlAgilityPack。这是周围最好的。

由此,您可以将任何 HTMl 页面转换为 XML,您可以比 HTML 更轻松地进行搜索。

如果您需要抓取网站以获取信息,请查看 NCrawler

For parsing of any HTML page, use HtmlAgilityPack. It's the best around.

From that you transform any HTMl page into XML which you can search through much easier than HTML.

If you need to crawl a site for information, have a look at NCrawler.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文