Python 调用多个 URL 并从中提取数据
我正在尝试编写一个调用网页的脚本(http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Tree&id=7742&lvl=3&lin =f&keep=1&srchmode=1&unlock),扫描它,然后提取每个嵌套分类中的目、科、属和种 团体。然而,我只想要脊椎动物(整个网站的一小部分),但与各种脊椎动物分类单元相关的 URL 不具有任何可辨别的模式(即顺序)。有没有办法合理地做到这一点?我在尝试制定不同的方法来实现这一目标时遇到了很多问题。
I'm trying to write a script that will call up a webpage (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Tree&id=7742&lvl=3&lin=f&keep=1&srchmode=1&unlock), scan it, and then pull out the orders, families, genera, and species within each nested taxonomic group. However, I only want the vertebrates (a very small section of the entire website) but the URLs associated with the various vertebrate taxa are not in any discernible pattern (i.e. sequential). Is there a way to do this reasonable? I've been having a lot of problems trying to formulate different ways to accomplish this goal.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
目前还不是很清楚(至少对我来说)你到底想做什么,但是,据我了解,你需要创建某种爬虫来查找页面中的链接,导航网站并对它们进行分类?
如果是这样,请使用mechanize,它允许您模拟网络浏览器并在网站上导航,轻松提取和跟踪链接、提交表单等。
It isn't very clear (at least, to me) what do you exactly want to do, but, for what I understood, you need to create some sort of crawler that looks for links in the pages, navigate the site and categorize them?
If so, go for mechanize, that allows you to emulate a web browser, and navigate along the site, easily extracting and following links, submitting forms, etc. etc.