Python:调用 Python 对象时超出最大递归深度
我构建了一个爬虫,它必须在大约 5M 页面上运行(通过增加 url ID),然后解析包含我需要的信息的页面。
使用在网址(200K)上运行的算法并保存好的和坏的结果后,我发现我浪费了很多时间。我可以看到有一些返回的减数,我可以用它们来检查下一个有效的网址。
你可以很快地看到减数(前几个“好 ID”的一点点)——
510000011 # +8
510000029 # +18
510000037 # +8
510000045 # +8
510000052 # +7
510000060 # +8
510000078 # +18
510000086 # +8
510000094 # +8
510000102 # +8
510000110 # etc'
510000128
510000136
510000144
510000151
510000169
510000177
510000185
510000193
510000201
在爬行了大约 200K 个 url 后,我只得到了 14K 好的结果,我知道我在浪费时间,需要优化它,所以我运行一些统计数据并构建了一个函数,该函数将检查 url,同时用 8\18\17\8(顶部返回减数)等增加 id。
这就是函数 -
def checkNextID(ID):
global numOfRuns, curRes, lastResult
while ID < lastResult:
try:
numOfRuns += 1
if numOfRuns % 10 == 0:
time.sleep(3) # sleep every 10 iterations
if isValid(ID + 8):
parseHTML(curRes)
checkNextID(ID + 8)
return 0
if isValid(ID + 18):
parseHTML(curRes)
checkNextID(ID + 18)
return 0
if isValid(ID + 7):
parseHTML(curRes)
checkNextID(ID + 7)
return 0
if isValid(ID + 17):
parseHTML(curRes)
checkNextID(ID + 17)
return 0
if isValid(ID+6):
parseHTML(curRes)
checkNextID(ID + 6)
return 0
if isValid(ID + 16):
parseHTML(curRes)
checkNextID(ID + 16)
return 0
else:
checkNextID(ID + 1)
return 0
except Exception, e:
print "somethin went wrong: " + str(e)
基本上做的是 -checkNextID(ID) 正在获取我知道的第一个包含数据负 8 的 id,因此第一次迭代将匹配第一个“if isValid”子句(isValid(ID + 8) 将返回真的)。
lastResult 是一个保存最后一个已知 url id 的变量,因此我们将运行直到 numOfRuns 为
isValid() 是一个获取 ID + 减数之一的函数,如果 url 包含我需要的内容,则返回 True,并将 url 的 soup 对象保存到名为 - 'curRes' 的全局变量,如果 url 不包含数据,则返回 False我需要。
parseHTML 是一个函数,它获取 soup 对象(curRes),解析我需要的数据,然后将数据保存到 csv,然后返回 True。
如果 isValid() 返回 True,我们将调用 parseHTML(),然后尝试检查下一个 ID + 减数(通过调用 checkNextID(ID + 减数),如果它们都不会返回我要查找的内容,我将将其增加 1 并再次检查,直到找到下一个有效的 url,
您可以在此处看到代码的其余部分。
运行代码后 我得到了大约 950~ 好的结果,突然出现了异常 -
“出了点问题:调用时超出了最大递归深度 Python 对象”
我可以在 WireShark 上看到 scipt 卡在 id - 510009541 上(我用 510000003 开始我的脚本),在我注意到错误并停止它之前,脚本尝试多次获取具有该 ID 的 url。
我真的很兴奋看到我得到了相同的结果,但比我的旧脚本快了 25 倍到 40 倍,HTTP 请求更少,非常精确,我只错过了 1 个结果是 1000 个好的结果,这是我发现的,不可能朗读 5M 次,我的旧脚本运行了 30 小时,得到了 14-15K 个结果,而我的新脚本在 5-10 分钟内给了我 960 个结果
。阅读有关堆栈限制的信息,但是我尝试在 Python 中实现的算法必须有一个解决方案(我无法回到我的旧“算法”,它永远不会结束)
。
I've built a crawler that had to run on about 5M pages (by increasing the url ID) and then parses the pages which contain the info' I need.
after using an algorithm which run on the urls (200K) and saved the good and bad results I found that the I'm wasting a lot of time. I could see that there are a a few returning subtrahends which I can use to check the next valid url.
you can see the subtrahends quite fast (a little ex' of the few first "good IDs") -
510000011 # +8
510000029 # +18
510000037 # +8
510000045 # +8
510000052 # +7
510000060 # +8
510000078 # +18
510000086 # +8
510000094 # +8
510000102 # +8
510000110 # etc'
510000128
510000136
510000144
510000151
510000169
510000177
510000185
510000193
510000201
after crawling about 200K urls which gave me only 14K good results I knew I was wasting my time and need to optimize it, so I run some statistics and built a function that will check the urls while increasing the id with 8\18\17\8 (top returning subtrahends ) etc'.
this is the function -
def checkNextID(ID):
global numOfRuns, curRes, lastResult
while ID < lastResult:
try:
numOfRuns += 1
if numOfRuns % 10 == 0:
time.sleep(3) # sleep every 10 iterations
if isValid(ID + 8):
parseHTML(curRes)
checkNextID(ID + 8)
return 0
if isValid(ID + 18):
parseHTML(curRes)
checkNextID(ID + 18)
return 0
if isValid(ID + 7):
parseHTML(curRes)
checkNextID(ID + 7)
return 0
if isValid(ID + 17):
parseHTML(curRes)
checkNextID(ID + 17)
return 0
if isValid(ID+6):
parseHTML(curRes)
checkNextID(ID + 6)
return 0
if isValid(ID + 16):
parseHTML(curRes)
checkNextID(ID + 16)
return 0
else:
checkNextID(ID + 1)
return 0
except Exception, e:
print "somethin went wrong: " + str(e)
what is basically does is -checkNextID(ID) is getting the first id I know that contain the data minus 8 so the first iteration will match the first "if isValid" clause (isValid(ID + 8) will return True).
lastResult is a variable which saves the last known url id, so we'll run until numOfRuns is
isValid() is a function that gets an ID + one of the subtrahends and returns True if the url contains what I need and saves a soup object of the url to a global varibale named - 'curRes', it returns False if the url doesn't contain the data I need.
parseHTML is a function that gets the soup object (curRes), parses the data I need and then saves the data to a csv, then returns True.
if isValid() returns True, we'll call parseHTML() and then try to check the next ID+the subtrahends (by calling checkNextID(ID + subtrahends), if none of them will return what I'm looking for I'll increase it with 1 and check again until I'll find the next valid url.
you can see the rest of the code here
after running the code I got about 950~ good results and suddenly an exception had raised -
"somethin went wrong: maximum recursion depth exceeded while calling a
Python object"
I could see on WireShark that the scipt stuck on id - 510009541 (I started my script with 510000003), the script tried getting the url with that ID a few times before I noticed the error and stopped it.
I was really exciting to see that I got the same results but 25x-40x times faster then my old script, with fewer HTTP requests, it's very precise, I have missed only 1 result for 1000 good results, which is find by me, it's impossible to rum 5M times, I had my old script running for 30 hours and got 14-15K results when my new script gave me 960~ results in 5-10 minutes.
I read about stack limitations, but there must be a solution for the algorithm I'm trying to implement in Python (I can't go back to my old "algorithm", it will never end).
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
Python 对递归没有很好的支持,因为它缺乏 TRE (尾递归消除)。
这意味着每次调用递归函数都会创建一个函数调用堆栈,并且因为堆栈深度有限制(默认为 1000),您可以通过
sys.getrecursionlimit
(当然你可以使用 sys.setrecursionlimit 但不推荐)你的程序最终会崩溃达到这个极限。由于其他答案已经为您提供了一种更好的方法来解决您的情况(即通过简单循环替换递归),如果您仍然想使用递归,则还有另一种解决方案,即使用像这样在 python 中实现 TRE one。
注意:我的回答是为了让您更深入地了解为什么会出现错误,我并不建议您使用我已经解释过的 TRE,因为在您的情况下,循环会更好并且易于阅读。
Python don't have a great support for recursion because of it's lack of TRE (Tail Recursion Elimination).
This means that each call to your recursive function will create a function call stack and because there is a limit of stack depth (by default is 1000) that you can check out by
sys.getrecursionlimit
(of course you can change it using sys.setrecursionlimit but it's not recommended) your program will end up by crashing when it hits this limit.As other answer has already give you a much nicer way for how to solve this in your case (which is to replace recursion by simple loop) there is another solution if you still want to use recursion which is to use one of the many recipes of implementing TRE in python like this one.
N.B: My answer is meant to give you more insight on why you get the error, and I'm not advising you to use the TRE as i already explained because in your case a loop will be much better and easy to read.
您可以通过以下方式增加堆栈的容量:
You can increase the capacity of the stack by the following :
这将递归变成循环:
this turns the recursion in to a loop:
您可以增加递归深度和线程堆栈大小。
You can increase the recursion depth and thread stack size.
代码中带有
checkNextID(ID + 18)
和类似部分的代码可以替换为ID+=18
,而不是进行递归,然后如果删除checkNextID(ID + 18)
的所有实例, code>return 0,那么它应该做同样的事情,但作为一个简单的循环。然后,您应该在末尾添加return 0
并使变量成为非全局变量。Instead of doing recursion, the parts of the code with
checkNextID(ID + 18)
and similar could be replaced withID+=18
, and then if you remove all instances ofreturn 0
, then it should do the same thing but as a simple loop. You should then put areturn 0
at the end and make your variables non-global.使用 try 和 except 但不要在 except 中打印错误,只需在 except 语句中再次运行您的函数
use try and except but don't print your error in except just run your function again in except statement