用python确定网站上的站点数量
我有以下链接:
参考部分url 包含以下信息:
A7 == 议会(当前是第七届议会,以前是 A6 等)
2010 == 年
0001 == 文件编号
对于每年和议会,我想确定的编号网站上的文档。该任务很复杂,例如,对于 2010 年,编号 186、195,196 有空页,而最大编号为 214。理想情况下,输出应该是包含所有文档编号(不包括丢失的文档编号)的向量。
谁能告诉我这在Python中是否可行?
最好的,托马斯
I have the following link:
the reference part of the url has the following information:
A7 == The parliament (current is the seventh parliament, the former is A6 and so forth)
2010 == year
0001 == document number
For every year and parliament I would like to identify the number of documents on the website. The task is complicated by the fact that for 2010, for instance, numbers 186, 195,196 have empty pages, while the max number is 214. Ideally the output should be a vector with all the document numbers, excluding the missing ones.
Can anyone tell me if this is possible in python?
Best, Thomas
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,确保抓取他们的网站是合法的。
其次,请注意,当文档不存在时,HTML 文件包含:
第三,使用 urllib 迭代您想要的所有内容:
First, make sure that scraping their site is legal.
Second, notice that when a document is not present, the HTML file contains:
Third, use urllib to iterate over all the things you want to:
这是一个解决方案,但在请求之间添加一些超时是一个好主意:
Here is a solution, but adding some timeout between request is a good idea:
这是一个稍微更完整(但很hacky)的示例,它似乎可以工作(使用 urllib2) - 我确信您可以根据您的特定需求自定义它。
我还要重复阿列塔的警告,确保网站所有者不介意您抓取其内容。
Here's a slightly more complete (but hacky) example which seems to work(using urllib2) - I'm sure you can customise it for your specific needs.
I'd also repeat Arrieta's warning about making sure the site's owner doesn't mind you scraping it's content.