当目录列表被停用时,有没有办法从网络服务器获取文件?
我尝试为每个基于网络服务器/网页的文件构建一个“爬虫”或“自动下载器”。
所以我认为有两种方法:
1)启用目录列表。很简单,读出列表中的数据并下载您看到的每个文件。
2) 目录列表被禁用。 然后呢? 唯一的想法是必须暴力破解文件名并查看服务器的反应(例如:404 表示没有文件,403 表示找到的目录,data 表示正确找到的数据)。
我的想法对吗?有更好的办法吗?
I try to build a "crawler" or a "atuomatic downloader" for each file is based on a webserver / webpage.
So in my oppinion there are two ways:
1) Directory Listing is enabled. Than its easy, read out the data that is in the listing and download every file you see.
2) Directory listing is disabled.
What then?
The only idea is have to brute force filenames and see the reaction of the server (e.g.: 404 for no file, 403 for a found directory, and data for the correct found data).
Is my idea right? Is there a better way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您始终可以解析 HTML 并查看并跟踪(“抓取”)您获得的链接。这是大多数爬虫的实现方式。
查看这些可以帮助您做到这一点的库:
.NET:Html Agility Pack
Python:美丽汤
PHP: HTMLSimpleDom
始终查找 robots.txt在网站的根目录中,并确保您遵守网站关于允许抓取哪些页面的规则。
You can always parse the HTML and look and follow ('crawl') the links you get. This the way most crawlers are implemented.
Check these libraries out that could help you do it:
.NET: Html Agility Pack
Python: Beautiful Soup
PHP: HTMLSimpleDom
ALWAYS look for robots.txt in the site's root and make sure you respect the site's rules on what pages are allowed to be be crawled.
您不应该为网站站长阻止您索引的页面建立索引。
这就是 Robots.txt 的全部内容。
您应该检查
SiteMap
文件,该文件在每个文件夹中的此处中有描述它通常是 sitemap.xml 或者有时它的名称在 Robots.txt 中提到
You shouldn't index the pages that the web master prevents you to.
this is all Robots.txt is about.
you should check for
SiteMap
file, which is described Here in each folderit is usually sitemap.xml or sometimes it's name is mentioned in Robots.txt