从域获取所有 pdf 文件(例如 *.adomain.com)
我需要从某个域下载所有 pdf 文件。该域上大约有 6000 个 pdf 文件,其中大多数没有 html 链接(要么他们已经删除了链接,要么他们从未将链接放在第一位)。
我知道大约有 6000 个文件,因为我正在谷歌搜索: filetype:pdf site:*.adomain.com
但是,Google 仅列出了前 1000 个结果。我相信有两种方法可以实现这一目标:
a) 使用 Google。但是,我如何才能从 Google 获得全部 6000 个结果呢?也许是刮刀? (尝试了scroogle,没有运气) b) 跳过 Google,直接在域中搜索 pdf 文件。当大多数都没有链接时我该怎么做?
I need to download all pdf files from a certain domain. There are about 6000 pdf on that domain and most of them don't have an html link (either they have removed the link or they never put one in the first place).
I know there are about 6000 files because I'm googling: filetype:pdf site:*.adomain.com
However, Google lists only the first 1000 results. I believe there are two ways to achieve this:
a) Use Google. However, how I can get all 6000 results from Google? Maybe a scraper? (tried scroogle, no luck)
b) Skip Google and search directly on domain for pdf files. How do I do that when most them are not linked?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果文件的链接已被删除,并且您没有列出目录的权限,则基本上不可能知道哪个 URL 后面有一个 pdf 文件。
您可以查看 http://www.archive.org 并查找页面的先前状态如果您认为过去曾有过这些文件的链接。
要递归检索网站上提到的所有 pdf,我推荐使用 wget。从 http://www. gnu.org/software/wget/manual/html_node/Advanced-Usage.html#Advanced-Usage
(只需将 .gif 替换为 .pdf!)
If the links to the files have been removed, and you have no permission to list the directories, it's basically impossible to know behind what URL there is a pdf-file.
You could have a look at http://www.archive.org and look up a previous state of the page if you believe there has been links to the files in the past.
To retrieve all pdfs mentioned on the site recursively I recommend wget. From the examples at http://www.gnu.org/software/wget/manual/html_node/Advanced-Usage.html#Advanced-Usage
(Simply replace .gif with .pdf!)