使用python爬取网站

发布于 2024-11-25 08:41:27 字数 1247 浏览 1 评论 0原文

因此,我正在寻找一种动态方式来抓取网站并从每个页面获取链接。我决定尝试一下 Beauitfulsoup。有两个问题:如何比使用嵌套 while 语句搜索链接更动态地执行此操作。我想获得该网站的所有链接。但我不想继续放置嵌套的 while 循环。

    topLevelLinks = self.getAllUniqueLinks(baseUrl)
    listOfLinks = list(topLevelLinks)       

    length = len(listOfLinks)
    count = 0       

    while(count < length):

        twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
        twoListOfLinks = list(twoLevelLinks)
        twoCount = 0
        twoLength = len(twoListOfLinks)

        for twoLinks in twoListOfLinks:
            listOfLinks.append(twoLinks)

        count = count + 1

        while(twoCount < twoLength):
            threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])  
            threeListOfLinks = list(threeLevelLinks)

            for threeLinks in threeListOfLinks:
                listOfLinks.append(threeLinks)

            twoCount = twoCount +1



    print '--------------------------------------------------------------------------------------'
    #remove all duplicates
    finalList = list(set(listOfLinks))  
    print finalList

无论如何,我的第二个问题是要告诉我是否从该网站获得了所有链接。请原谅我,我对 python 有点陌生(一年左右),我知道我的一些流程和逻辑可能很幼稚。但我必须以某种方式学习。主要是我只想比使用嵌套 while 循环更加动态。预先感谢您的任何见解。

So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops.

    topLevelLinks = self.getAllUniqueLinks(baseUrl)
    listOfLinks = list(topLevelLinks)       

    length = len(listOfLinks)
    count = 0       

    while(count < length):

        twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
        twoListOfLinks = list(twoLevelLinks)
        twoCount = 0
        twoLength = len(twoListOfLinks)

        for twoLinks in twoListOfLinks:
            listOfLinks.append(twoLinks)

        count = count + 1

        while(twoCount < twoLength):
            threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])  
            threeListOfLinks = list(threeLevelLinks)

            for threeLinks in threeListOfLinks:
                listOfLinks.append(threeLinks)

            twoCount = twoCount +1



    print '--------------------------------------------------------------------------------------'
    #remove all duplicates
    finalList = list(set(listOfLinks))  
    print finalList

My second questions is there anyway to tell if I got all the links from the site. Please forgive me, I am somewhat new to python (year or so) and I know some of my processes and logic might be childish. But I have to learn somehow. Mainly I just want to do this more dynamic then using nested while loop. Thanks in advance for any insight.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

冷心人i 2024-12-02 08:41:27

抓取网站并获取所有链接的问题是一个常见问题。如果您在 Google 上搜索“spider web site python”,您可以找到可以为您执行此操作的库。这是我发现的:

http://pypi.python.org/pypi/spider.py /0.5

更好的是,Google 发现这个问题已经在 StackOverflow 上提出并得到了解答:

任何人都知道一个好的基于Python的网络我可以使用爬虫吗?

The problem of spidering over a web site and getting all the links is a common problem. If you Google search for "spider web site python" you can find libraries that will do this for you. Here's one I found:

http://pypi.python.org/pypi/spider.py/0.5

Even better, Google found this question already asked and answered here on StackOverflow:

Anyone know of a good Python based web crawler that I could use?

白馒头 2024-12-02 08:41:27

如果使用 BeautifulSoup,为什么不使用 findAll() 方法?基本上,在我的爬虫中,我这样做:

self.soup = BeautifulSoup(HTMLcode)
for frm in self.soup.findAll(str('frame')):
try:
    if not frm.has_key('src'):
        continue
    src = frm[str('src')]
    #rest of URL processing here
except Exception, e:
    print  'Parser <frame> tag error: ', str(e)

对于框架标签。 “img src”和“a href”标签也是如此。
不过我喜欢这个话题——也许是我错了......
编辑:有 ofc 一个顶级实例,它保存 URL 并稍后从每个链接获取 HTMLcode...

If using BeautifulSoup, why don't you use findAll() method ?? Basically, in my crawler i do:

self.soup = BeautifulSoup(HTMLcode)
for frm in self.soup.findAll(str('frame')):
try:
    if not frm.has_key('src'):
        continue
    src = frm[str('src')]
    #rest of URL processing here
except Exception, e:
    print  'Parser <frame> tag error: ', str(e)

for the frame tag. The same goes for "img src"and "a href" tags.
I like the topic though - maybe its me who has sth wrong here...
edit: there is ofc a top-level instance, which saves the URLs and gets the HTMLcode from each link later...

素罗衫 2024-12-02 08:41:27

为了从评论中回答你的问题,这里有一个例子(它是用 ruby​​ 编写的,但我不知道 python,它们足够相似,让你能够轻松地理解):

#!/usr/bin/env ruby

require 'open-uri'

hyperlinks = []
visited = []

# add all the hyperlinks from a url to the array of urls
def get_hyperlinks url
  links = []
  begin
    s = open(url).read
    s.scan(/(href|src)\w*=\w*[\",\']\S+[\",\']/) do
      link = 
amp;.gsub(/((href|src)\w*=\w*[\",\']|[\",\'])/, '')
      link = url + link if link[0] == '/'

      # add to array if not already there
      links << link if not links =~ /url/
    end
  rescue
    puts 'Looks like we can\'t be here...'
  end
  links
end

print 'Enter a start URL: '
hyperlinks << gets.chomp
puts 'Off we go!'
count = 0
while true
  break if hyperlinks.length == 0
  link = hyperlinks.shift
  next if visited.include? link
  visited << link
  puts "Connecting to #{link}..."
  links = get_hyperlinks(link)
  puts "Found #{links.length} links on #{link}..."
  hyperlinks = links + hyperlinks
  puts "Moving on with #{hyperlinks.length} links left...\n\n"
end

对于 ruby​​ 感到抱歉,但它是一种更好的语言:P 并且应该不难适应,或者像我说的那样,理解。

To answer your question from the comment, here's an example (it's in ruby, but I don't know python, and they are similar enough for you to be able to follow along easily):

#!/usr/bin/env ruby

require 'open-uri'

hyperlinks = []
visited = []

# add all the hyperlinks from a url to the array of urls
def get_hyperlinks url
  links = []
  begin
    s = open(url).read
    s.scan(/(href|src)\w*=\w*[\",\']\S+[\",\']/) do
      link = 
amp;.gsub(/((href|src)\w*=\w*[\",\']|[\",\'])/, '')
      link = url + link if link[0] == '/'

      # add to array if not already there
      links << link if not links =~ /url/
    end
  rescue
    puts 'Looks like we can\'t be here...'
  end
  links
end

print 'Enter a start URL: '
hyperlinks << gets.chomp
puts 'Off we go!'
count = 0
while true
  break if hyperlinks.length == 0
  link = hyperlinks.shift
  next if visited.include? link
  visited << link
  puts "Connecting to #{link}..."
  links = get_hyperlinks(link)
  puts "Found #{links.length} links on #{link}..."
  hyperlinks = links + hyperlinks
  puts "Moving on with #{hyperlinks.length} links left...\n\n"
end

sorry about the ruby, but its a better language :P and shouldn't be hard to adapt or, like i said, understand.

你列表最软的妹 2024-12-02 08:41:27

1)在Python中,我们不计算容器的元素并使用它们进行索引;我们只是迭代它的元素,因为这就是我们想要做的。

2)为了处理多层次的链接,我们可以使用递归。

def followAllLinks(self, from_where):
    for link in list(self.getAllUniqueLinks(from_where)):
        self.followAllLinks(link)

这不处理链接循环,但原始方法也没有处理。您可以通过随时构建一组已访问过的链接来处理此问题。

1) In Python, we do not count elements of a container and use them to index in; we just iterate over its elements, because that's what we want to do.

2) To handle multiple levels of links, we can use recursion.

def followAllLinks(self, from_where):
    for link in list(self.getAllUniqueLinks(from_where)):
        self.followAllLinks(link)

This does not handle cycles of links, but neither did the original approach. You can handle that by building a set of already-visited links as you go.

呆萌少年 2024-12-02 08:41:27

使用scrapy

Scrapy 是一种快速的高级屏幕抓取和网页抓取
框架,用于爬取网站并从中提取结构化数据
他们的页面。它可用于多种用途,从数据
挖掘到监控和自动化测试。

Use scrapy:

Scrapy is a fast high-level screen scraping and web crawling
framework, used to crawl websites and extract structured data from
their pages. It can be used for a wide range of purposes, from data
mining to monitoring and automated testing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文