用python确定网站上的站点数量

发布于 2024-09-08 22:19:26 字数 552 浏览 1 评论 0原文

我有以下链接:

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0001&language=EN

参考部分url 包含以下信息:

A7 == 议会(当前是第七届议会,以前是 A6 等)

2010 == 年

0001 == 文件编号

对于每年和议会,我想确定的编号网站上的文档。该任务很复杂,例如,对于 2010 年,编号 186、195,196 有空页,而最大编号为 214。理想情况下,输出应该是包含所有文档编号(不包括丢失的文档编号)的向量。

谁能告诉我这在Python中是否可行?

最好的,托马斯

I have the following link:

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0001&language=EN

the reference part of the url has the following information:

A7 == The parliament (current is the seventh parliament, the former is A6 and so forth)

2010 == year

0001 == document number

For every year and parliament I would like to identify the number of documents on the website. The task is complicated by the fact that for 2010, for instance, numbers 186, 195,196 have empty pages, while the max number is 214. Ideally the output should be a vector with all the document numbers, excluding the missing ones.

Can anyone tell me if this is possible in python?

Best, Thomas

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

得不到的就毁灭 2024-09-15 22:19:26

首先,确保抓取他们的网站是合法的。

其次,请注意,当文档不存在时,HTML 文件包含:

<title>Application Error</title>

第三,使用 urllib 迭代您想要的所有内容:

for p in range(1,7):
 for y in range(2000, 2011):
  doc = 1
  while True:
    # use urllib to open the url: (root)+p+y+doc
    # if the HTML has the string "application error" break from the while
    doc+=1

First, make sure that scraping their site is legal.

Second, notice that when a document is not present, the HTML file contains:

<title>Application Error</title>

Third, use urllib to iterate over all the things you want to:

for p in range(1,7):
 for y in range(2000, 2011):
  doc = 1
  while True:
    # use urllib to open the url: (root)+p+y+doc
    # if the HTML has the string "application error" break from the while
    doc+=1
撩心不撩汉 2024-09-15 22:19:26

这是一个解决方案,但在请求之间添加一些超时是一个好主意:

import urllib
URL_TEMPLATE="http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-%d-%.4d&language=EN"
maxRange=300

for year in [2010, 2011]:
    for page in range(1,maxRange):
        f=urllib.urlopen(URL_TEMPLATE%(year, page))
        text=f.read()
        if "<title>Application Error</title>" in text:
            print "year %d and page %.4d NOT found" %(year, page)
        else:
            print "year %d and page %.4d FOUND" %(year, page)
        f.close()

Here is a solution, but adding some timeout between request is a good idea:

import urllib
URL_TEMPLATE="http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-%d-%.4d&language=EN"
maxRange=300

for year in [2010, 2011]:
    for page in range(1,maxRange):
        f=urllib.urlopen(URL_TEMPLATE%(year, page))
        text=f.read()
        if "<title>Application Error</title>" in text:
            print "year %d and page %.4d NOT found" %(year, page)
        else:
            print "year %d and page %.4d FOUND" %(year, page)
        f.close()
七秒鱼° 2024-09-15 22:19:26

这是一个稍微更完整(但很hacky)的示例,它似乎可以工作(使用 urllib2) - 我确信您可以根据您的特定需求自定义它。

我还要重复阿列塔的警告,确保网站所有者不介意您抓取其内容。

#!/usr/bin/env python
import httplib2
h = httplib2.Http(".cache")

parliament = "A7"
year = 2010

#Create two lists, one list of URLs and one list of document numbers.
urllist = []
doclist = []

urltemplate = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=%s-%d-%04u&language=EN"

for document in range(0,9999):
    url = urltemplate % (parliament,year,document)
    resp, content = h.request(url, "GET")
    if content.find("Application Error") == -1:
        print "Document %04u exists" % (document)    
        urllist.append(urltemplate % (parliament,year,document))
        doclist.append(document)
    else:
        print "Document %04u doesn't exist" % (document)
print "Parliament %s, year %u has %u documents" % (parliament,year,len(doclist))

Here's a slightly more complete (but hacky) example which seems to work(using urllib2) - I'm sure you can customise it for your specific needs.

I'd also repeat Arrieta's warning about making sure the site's owner doesn't mind you scraping it's content.

#!/usr/bin/env python
import httplib2
h = httplib2.Http(".cache")

parliament = "A7"
year = 2010

#Create two lists, one list of URLs and one list of document numbers.
urllist = []
doclist = []

urltemplate = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=%s-%d-%04u&language=EN"

for document in range(0,9999):
    url = urltemplate % (parliament,year,document)
    resp, content = h.request(url, "GET")
    if content.find("Application Error") == -1:
        print "Document %04u exists" % (document)    
        urllist.append(urltemplate % (parliament,year,document))
        doclist.append(document)
    else:
        print "Document %04u doesn't exist" % (document)
print "Parliament %s, year %u has %u documents" % (parliament,year,len(doclist))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文