使用python爬取网站
因此,我正在寻找一种动态方式来抓取网站并从每个页面获取链接。我决定尝试一下 Beauitfulsoup。有两个问题:如何比使用嵌套 while 语句搜索链接更动态地执行此操作。我想获得该网站的所有链接。但我不想继续放置嵌套的 while 循环。
topLevelLinks = self.getAllUniqueLinks(baseUrl)
listOfLinks = list(topLevelLinks)
length = len(listOfLinks)
count = 0
while(count < length):
twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
twoListOfLinks = list(twoLevelLinks)
twoCount = 0
twoLength = len(twoListOfLinks)
for twoLinks in twoListOfLinks:
listOfLinks.append(twoLinks)
count = count + 1
while(twoCount < twoLength):
threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])
threeListOfLinks = list(threeLevelLinks)
for threeLinks in threeListOfLinks:
listOfLinks.append(threeLinks)
twoCount = twoCount +1
print '--------------------------------------------------------------------------------------'
#remove all duplicates
finalList = list(set(listOfLinks))
print finalList
无论如何,我的第二个问题是要告诉我是否从该网站获得了所有链接。请原谅我,我对 python 有点陌生(一年左右),我知道我的一些流程和逻辑可能很幼稚。但我必须以某种方式学习。主要是我只想比使用嵌套 while 循环更加动态。预先感谢您的任何见解。
So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops.
topLevelLinks = self.getAllUniqueLinks(baseUrl)
listOfLinks = list(topLevelLinks)
length = len(listOfLinks)
count = 0
while(count < length):
twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
twoListOfLinks = list(twoLevelLinks)
twoCount = 0
twoLength = len(twoListOfLinks)
for twoLinks in twoListOfLinks:
listOfLinks.append(twoLinks)
count = count + 1
while(twoCount < twoLength):
threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])
threeListOfLinks = list(threeLevelLinks)
for threeLinks in threeListOfLinks:
listOfLinks.append(threeLinks)
twoCount = twoCount +1
print '--------------------------------------------------------------------------------------'
#remove all duplicates
finalList = list(set(listOfLinks))
print finalList
My second questions is there anyway to tell if I got all the links from the site. Please forgive me, I am somewhat new to python (year or so) and I know some of my processes and logic might be childish. But I have to learn somehow. Mainly I just want to do this more dynamic then using nested while loop. Thanks in advance for any insight.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
抓取网站并获取所有链接的问题是一个常见问题。如果您在 Google 上搜索“spider web site python”,您可以找到可以为您执行此操作的库。这是我发现的:
http://pypi.python.org/pypi/spider.py /0.5
更好的是,Google 发现这个问题已经在 StackOverflow 上提出并得到了解答:
任何人都知道一个好的基于Python的网络我可以使用爬虫吗?
The problem of spidering over a web site and getting all the links is a common problem. If you Google search for "spider web site python" you can find libraries that will do this for you. Here's one I found:
http://pypi.python.org/pypi/spider.py/0.5
Even better, Google found this question already asked and answered here on StackOverflow:
Anyone know of a good Python based web crawler that I could use?
如果使用 BeautifulSoup,为什么不使用 findAll() 方法?基本上,在我的爬虫中,我这样做:
对于框架标签。 “img src”和“a href”标签也是如此。
不过我喜欢这个话题——也许是我错了......
编辑:有 ofc 一个顶级实例,它保存 URL 并稍后从每个链接获取 HTMLcode...
If using BeautifulSoup, why don't you use findAll() method ?? Basically, in my crawler i do:
for the frame tag. The same goes for "img src"and "a href" tags.
I like the topic though - maybe its me who has sth wrong here...
edit: there is ofc a top-level instance, which saves the URLs and gets the HTMLcode from each link later...
为了从评论中回答你的问题,这里有一个例子(它是用 ruby 编写的,但我不知道 python,它们足够相似,让你能够轻松地理解):
对于 ruby 感到抱歉,但它是一种更好的语言:P 并且应该不难适应,或者像我说的那样,理解。
To answer your question from the comment, here's an example (it's in ruby, but I don't know python, and they are similar enough for you to be able to follow along easily):
sorry about the ruby, but its a better language :P and shouldn't be hard to adapt or, like i said, understand.
1)在Python中,我们不计算容器的元素并使用它们进行索引;我们只是迭代它的元素,因为这就是我们想要做的。
2)为了处理多层次的链接,我们可以使用递归。
这不处理链接循环,但原始方法也没有处理。您可以通过随时构建一组已访问过的链接来处理此问题。
1) In Python, we do not count elements of a container and use them to index in; we just iterate over its elements, because that's what we want to do.
2) To handle multiple levels of links, we can use recursion.
This does not handle cycles of links, but neither did the original approach. You can handle that by building a
set
of already-visited links as you go.使用scrapy:
Use scrapy: