当前位置：文江博客话题详情

用python确定网站上的站点数量

发布于 2024-09-08 22:19:26 字数 552 浏览 1 评论 0原文

我有以下链接：

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-0001&language=EN

参考部分url 包含以下信息：

A7 == 议会（当前是第七届议会，以前是 A6 等）

2010 == 年

0001 == 文件编号

对于每年和议会，我想确定的编号网站上的文档。该任务很复杂，例如，对于 2010 年，编号 186、195,196 有空页，而最大编号为 214。理想情况下，输出应该是包含所有文档编号（不包括丢失的文档编号）的向量。

谁能告诉我这在Python中是否可行？

最好的，托马斯

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

得不到的就毁灭 2024-09-15 22:19:26

首先，确保抓取他们的网站是合法的。

其次，请注意，当文档不存在时，HTML 文件包含：

<title>Application Error</title>

第三，使用 urllib 迭代您想要的所有内容：

for p in range(1,7):
 for y in range(2000, 2011):
  doc = 1
  while True:
    # use urllib to open the url: (root)+p+y+doc
    # if the HTML has the string "application error" break from the while
    doc+=1

First, make sure that scraping their site is legal.

Second, notice that when a document is not present, the HTML file contains:

<title>Application Error</title>

Third, use urllib to iterate over all the things you want to:

for p in range(1,7):
 for y in range(2000, 2011):
  doc = 1
  while True:
    # use urllib to open the url: (root)+p+y+doc
    # if the HTML has the string "application error" break from the while
    doc+=1

回复收藏 0 原文

撩心不撩汉 2024-09-15 22:19:26

这是一个解决方案，但在请求之间添加一些超时是一个好主意：

import urllib
URL_TEMPLATE="http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-%d-%.4d&language=EN"
maxRange=300

for year in [2010, 2011]:
    for page in range(1,maxRange):
        f=urllib.urlopen(URL_TEMPLATE%(year, page))
        text=f.read()
        if "<title>Application Error</title>" in text:
            print "year %d and page %.4d NOT found" %(year, page)
        else:
            print "year %d and page %.4d FOUND" %(year, page)
        f.close()

Here is a solution, but adding some timeout between request is a good idea:

import urllib
URL_TEMPLATE="http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-%d-%.4d&language=EN"
maxRange=300

for year in [2010, 2011]:
    for page in range(1,maxRange):
        f=urllib.urlopen(URL_TEMPLATE%(year, page))
        text=f.read()
        if "<title>Application Error</title>" in text:
            print "year %d and page %.4d NOT found" %(year, page)
        else:
            print "year %d and page %.4d FOUND" %(year, page)
        f.close()

回复收藏 0 原文

七秒鱼° 2024-09-15 22:19:26

这是一个稍微更完整（但很hacky）的示例，它似乎可以工作（使用 urllib2） - 我确信您可以根据您的特定需求自定义它。

我还要重复阿列塔的警告，确保网站所有者不介意您抓取其内容。

#!/usr/bin/env python
import httplib2
h = httplib2.Http(".cache")

parliament = "A7"
year = 2010

#Create two lists, one list of URLs and one list of document numbers.
urllist = []
doclist = []

urltemplate = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=%s-%d-%04u&language=EN"

for document in range(0,9999):
    url = urltemplate % (parliament,year,document)
    resp, content = h.request(url, "GET")
    if content.find("Application Error") == -1:
        print "Document %04u exists" % (document)    
        urllist.append(urltemplate % (parliament,year,document))
        doclist.append(document)
    else:
        print "Document %04u doesn't exist" % (document)
print "Parliament %s, year %u has %u documents" % (parliament,year,len(doclist))

Here's a slightly more complete (but hacky) example which seems to work(using urllib2) - I'm sure you can customise it for your specific needs.

I'd also repeat Arrieta's warning about making sure the site's owner doesn't mind you scraping it's content.

#!/usr/bin/env python
import httplib2
h = httplib2.Http(".cache")

parliament = "A7"
year = 2010

#Create two lists, one list of URLs and one list of document numbers.
urllist = []
doclist = []

urltemplate = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=%s-%d-%04u&language=EN"

for document in range(0,9999):
    url = urltemplate % (parliament,year,document)
    resp, content = h.request(url, "GET")
    if content.find("Application Error") == -1:
        print "Document %04u exists" % (document)    
        urllist.append(urltemplate % (parliament,year,document))
        doclist.append(document)
    else:
        print "Document %04u doesn't exist" % (document)
print "Parliament %s, year %u has %u documents" % (parliament,year,len(doclist))

回复收藏 0 原文

~没有更多了~