使用python爬取网站

发布于 2024-11-25 08:41:27 字数 1247 浏览 1 评论 0原文

因此，我正在寻找一种动态方式来抓取网站并从每个页面获取链接。我决定尝试一下 Beauitfulsoup。有两个问题：如何比使用嵌套 while 语句搜索链接更动态地执行此操作。我想获得该网站的所有链接。但我不想继续放置嵌套的 while 循环。

    topLevelLinks = self.getAllUniqueLinks(baseUrl)
    listOfLinks = list(topLevelLinks)       

    length = len(listOfLinks)
    count = 0       

    while(count < length):

        twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
        twoListOfLinks = list(twoLevelLinks)
        twoCount = 0
        twoLength = len(twoListOfLinks)

        for twoLinks in twoListOfLinks:
            listOfLinks.append(twoLinks)

        count = count + 1

        while(twoCount < twoLength):
            threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])  
            threeListOfLinks = list(threeLevelLinks)

            for threeLinks in threeListOfLinks:
                listOfLinks.append(threeLinks)

            twoCount = twoCount +1



    print '--------------------------------------------------------------------------------------'
    #remove all duplicates
    finalList = list(set(listOfLinks))  
    print finalList

无论如何，我的第二个问题是要告诉我是否从该网站获得了所有链接。请原谅我，我对 python 有点陌生（一年左右），我知道我的一些流程和逻辑可能很幼稚。但我必须以某种方式学习。主要是我只想比使用嵌套 while 循环更加动态。预先感谢您的任何见解。

原文

So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops.

    topLevelLinks = self.getAllUniqueLinks(baseUrl)
    listOfLinks = list(topLevelLinks)       

    length = len(listOfLinks)
    count = 0       

    while(count < length):

        twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
        twoListOfLinks = list(twoLevelLinks)
        twoCount = 0
        twoLength = len(twoListOfLinks)

        for twoLinks in twoListOfLinks:
            listOfLinks.append(twoLinks)

        count = count + 1

        while(twoCount < twoLength):
            threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])  
            threeListOfLinks = list(threeLevelLinks)

            for threeLinks in threeListOfLinks:
                listOfLinks.append(threeLinks)

            twoCount = twoCount +1



    print '--------------------------------------------------------------------------------------'
    #remove all duplicates
    finalList = list(set(listOfLinks))  
    print finalList

My second questions is there anyway to tell if I got all the links from the site. Please forgive me, I am somewhat new to python (year or so) and I know some of my processes and logic might be childish. But I have to learn somehow. Mainly I just want to do this more dynamic then using nested while loop. Thanks in advance for any insight.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷心人i 2024-12-02 08:41:27

抓取网站并获取所有链接的问题是一个常见问题。如果您在 Google 上搜索“spider web site python”，您可以找到可以为您执行此操作的库。这是我发现的：

http://pypi.python.org/pypi/spider.py /0.5

更好的是，Google 发现这个问题已经在 StackOverflow 上提出并得到了解答：

任何人都知道一个好的基于Python的网络我可以使用爬虫吗？

回复收藏 0 原文

白馒头 2024-12-02 08:41:27

如果使用 BeautifulSoup，为什么不使用 findAll() 方法？基本上，在我的爬虫中，我这样做：

self.soup = BeautifulSoup(HTMLcode)
for frm in self.soup.findAll(str('frame')):
try:
    if not frm.has_key('src'):
        continue
    src = frm[str('src')]
    #rest of URL processing here
except Exception, e:
    print  'Parser <frame> tag error: ', str(e)

对于框架标签。 “img src”和“a href”标签也是如此。
不过我喜欢这个话题——也许是我错了......
编辑：有 ofc 一个顶级实例，它保存 URL 并稍后从每个链接获取 HTMLcode...

If using BeautifulSoup, why don't you use findAll() method ?? Basically, in my crawler i do:

self.soup = BeautifulSoup(HTMLcode)
for frm in self.soup.findAll(str('frame')):
try:
    if not frm.has_key('src'):
        continue
    src = frm[str('src')]
    #rest of URL processing here
except Exception, e:
    print  'Parser <frame> tag error: ', str(e)

for the frame tag. The same goes for "img src"and "a href" tags.
I like the topic though - maybe its me who has sth wrong here...
edit: there is ofc a top-level instance, which saves the URLs and gets the HTMLcode from each link later...

回复收藏 0 原文

素罗衫 2024-12-02 08:41:27

为了从评论中回答你的问题，这里有一个例子（它是用 ruby 编写的，但我不知道 python，它们足够相似，让你能够轻松地理解）：

#!/usr/bin/env ruby

require 'open-uri'

hyperlinks = []
visited = []

# add all the hyperlinks from a url to the array of urls
def get_hyperlinks url
  links = []
  begin
    s = open(url).read
    s.scan(/(href|src)\w*=\w*[\",\']\S+[\",\']/) do
      link = amp;.gsub(/((href|src)\w*=\w*[\",\']|[\",\'])/, '')
      link = url + link if link[0] == '/'

      # add to array if not already there
      links << link if not links =~ /url/
    end
  rescue
    puts 'Looks like we can\'t be here...'
  end
  links
end

print 'Enter a start URL: '
hyperlinks << gets.chomp
puts 'Off we go!'
count = 0
while true
  break if hyperlinks.length == 0
  link = hyperlinks.shift
  next if visited.include? link
  visited << link
  puts "Connecting to #{link}..."
  links = get_hyperlinks(link)
  puts "Found #{links.length} links on #{link}..."
  hyperlinks = links + hyperlinks
  puts "Moving on with #{hyperlinks.length} links left...\n\n"
end

对于 ruby 感到抱歉，但它是一种更好的语言:P 并且应该不难适应，或者像我说的那样，理解。

To answer your question from the comment, here's an example (it's in ruby, but I don't know python, and they are similar enough for you to be able to follow along easily):

#!/usr/bin/env ruby

require 'open-uri'

hyperlinks = []
visited = []

# add all the hyperlinks from a url to the array of urls
def get_hyperlinks url
  links = []
  begin
    s = open(url).read
    s.scan(/(href|src)\w*=\w*[\",\']\S+[\",\']/) do
      link = amp;.gsub(/((href|src)\w*=\w*[\",\']|[\",\'])/, '')
      link = url + link if link[0] == '/'

      # add to array if not already there
      links << link if not links =~ /url/
    end
  rescue
    puts 'Looks like we can\'t be here...'
  end
  links
end

print 'Enter a start URL: '
hyperlinks << gets.chomp
puts 'Off we go!'
count = 0
while true
  break if hyperlinks.length == 0
  link = hyperlinks.shift
  next if visited.include? link
  visited << link
  puts "Connecting to #{link}..."
  links = get_hyperlinks(link)
  puts "Found #{links.length} links on #{link}..."
  hyperlinks = links + hyperlinks
  puts "Moving on with #{hyperlinks.length} links left...\n\n"
end

sorry about the ruby, but its a better language :P and shouldn't be hard to adapt or, like i said, understand.

回复收藏 0 原文

你列表最软的妹 2024-12-02 08:41:27

1）在Python中，我们不计算容器的元素并使用它们进行索引；我们只是迭代它的元素，因为这就是我们想要做的。

2）为了处理多层次的链接，我们可以使用递归。

def followAllLinks(self, from_where):
    for link in list(self.getAllUniqueLinks(from_where)):
        self.followAllLinks(link)

这不处理链接循环，但原始方法也没有处理。您可以通过随时构建一组已访问过的链接来处理此问题。

1) In Python, we do not count elements of a container and use them to index in; we just iterate over its elements, because that's what we want to do.

2) To handle multiple levels of links, we can use recursion.

def followAllLinks(self, from_where):
    for link in list(self.getAllUniqueLinks(from_where)):
        self.followAllLinks(link)

This does not handle cycles of links, but neither did the original approach. You can handle that by building a set of already-visited links as you go.

回复收藏 0 原文