如何优化我的 PageRank 计算?
在书中 编程集体智慧 我找到了以下函数来计算 PageRank:
def calculatepagerank(self,iterations=20):
# clear out the current PageRank tables
self.con.execute("drop table if exists pagerank")
self.con.execute("create table pagerank(urlid primary key,score)")
self.con.execute("create index prankidx on pagerank(urlid)")
# initialize every url with a PageRank of 1.0
self.con.execute("insert into pagerank select rowid,1.0 from urllist")
self.dbcommit()
for i in range(iterations):
print "Iteration %d" % i
for (urlid,) in self.con.execute("select rowid from urllist"):
pr=0.15
# Loop through all the pages that link to this one
for (linker,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
# Get the PageRank of the linker
linkingpr=self.con.execute("select score from pagerank where urlid=%d" % linker).fetchone()[0]
# Get the total number of links from the linker
linkingcount=self.con.execute("select count(*) from link where fromid=%d" % linker).fetchone()[0]
pr+=0.85*(linkingpr/linkingcount)
self.con.execute("update pagerank set score=%f where urlid=%d" % (pr,urlid))
self.dbcommit()
但是,这个函数非常慢,因为每次迭代中都有所有 SQL 查询
>>> import cProfile
>>> cProfile.run("crawler.calculatepagerank()")
2262510 function calls in 136.006 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 136.006 136.006 <string>:1(<module>)
1 20.826 20.826 136.006 136.006 searchengine.py:179(calculatepagerank)
21 0.000 0.000 0.528 0.025 searchengine.py:27(dbcommit)
21 0.528 0.025 0.528 0.025 {method 'commit' of 'sqlite3.Connecti
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler
1339864 112.602 0.000 112.602 0.000 {method 'execute' of 'sqlite3.Connec
922600 2.050 0.000 2.050 0.000 {method 'fetchone' of 'sqlite3.Cursor'
1 0.000 0.000 0.000 0.000 {range}
所以我优化了该函数并提出了这个:
def calculatepagerank2(self,iterations=20):
# clear out the current PageRank tables
self.con.execute("drop table if exists pagerank")
self.con.execute("create table pagerank(urlid primary key,score)")
self.con.execute("create index prankidx on pagerank(urlid)")
# initialize every url with a PageRank of 1.0
self.con.execute("insert into pagerank select rowid,1.0 from urllist")
self.dbcommit()
inlinks={}
numoutlinks={}
pagerank={}
for (urlid,) in self.con.execute("select rowid from urllist"):
inlinks[urlid]=[]
numoutlinks[urlid]=0
# Initialize pagerank vector with 1.0
pagerank[urlid]=1.0
# Loop through all the pages that link to this one
for (inlink,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
inlinks[urlid].append(inlink)
# get number of outgoing links from a page
numoutlinks[urlid]=self.con.execute("select count(*) from link where fromid=%d" % urlid).fetchone()[0]
for i in range(iterations):
print "Iteration %d" % i
for urlid in pagerank:
pr=0.15
for link in inlinks[urlid]:
linkpr=pagerank[link]
linkcount=numoutlinks[link]
pr+=0.85*(linkpr/linkcount)
pagerank[urlid]=pr
for urlid in pagerank:
self.con.execute("update pagerank set score=%f where urlid=%d" % (pagerank[urlid],urlid))
self.dbcommit()
这个函数快了很多倍(但是使用了为所有临时字典节省更多内存),因为它避免了每次迭代中不必要的 SQL 查询:
>>> cProfile.run("crawler.calculatepagerank2()")
90070 function calls in 3.527 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 3.527 3.527 <string>:1(<module>)
1 1.154 1.154 3.523 3.523 searchengine.py:207(calculatepagerank2
2 0.000 0.000 0.058 0.029 searchengine.py:27(dbcommit)
23065 0.013 0.000 0.013 0.000 {method 'append' of 'list' objects}
2 0.058 0.029 0.058 0.029 {method 'commit' of 'sqlite3.Connectio
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler
43932 2.261 0.000 2.261 0.000 {method 'execute' of 'sqlite3.Connecti
23065 0.037 0.000 0.037 0.000 {method 'fetchone' of 'sqlite3.Cursor'
1 0.000 0.000 0.000 0.000 {range}
但是是否有可能进一步减少 SQL 查询的数量以进一步加快函数速度? 更新:修复了calculatepagerank2()中的缩进。
In the book Programming Collective Intelligence I found the following function to compute the PageRank:
def calculatepagerank(self,iterations=20):
# clear out the current PageRank tables
self.con.execute("drop table if exists pagerank")
self.con.execute("create table pagerank(urlid primary key,score)")
self.con.execute("create index prankidx on pagerank(urlid)")
# initialize every url with a PageRank of 1.0
self.con.execute("insert into pagerank select rowid,1.0 from urllist")
self.dbcommit()
for i in range(iterations):
print "Iteration %d" % i
for (urlid,) in self.con.execute("select rowid from urllist"):
pr=0.15
# Loop through all the pages that link to this one
for (linker,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
# Get the PageRank of the linker
linkingpr=self.con.execute("select score from pagerank where urlid=%d" % linker).fetchone()[0]
# Get the total number of links from the linker
linkingcount=self.con.execute("select count(*) from link where fromid=%d" % linker).fetchone()[0]
pr+=0.85*(linkingpr/linkingcount)
self.con.execute("update pagerank set score=%f where urlid=%d" % (pr,urlid))
self.dbcommit()
However, this function is very slow, because of all the SQL queries in every iteration
>>> import cProfile
>>> cProfile.run("crawler.calculatepagerank()")
2262510 function calls in 136.006 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 136.006 136.006 <string>:1(<module>)
1 20.826 20.826 136.006 136.006 searchengine.py:179(calculatepagerank)
21 0.000 0.000 0.528 0.025 searchengine.py:27(dbcommit)
21 0.528 0.025 0.528 0.025 {method 'commit' of 'sqlite3.Connecti
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler
1339864 112.602 0.000 112.602 0.000 {method 'execute' of 'sqlite3.Connec
922600 2.050 0.000 2.050 0.000 {method 'fetchone' of 'sqlite3.Cursor'
1 0.000 0.000 0.000 0.000 {range}
So I optimized the function and came up with this:
def calculatepagerank2(self,iterations=20):
# clear out the current PageRank tables
self.con.execute("drop table if exists pagerank")
self.con.execute("create table pagerank(urlid primary key,score)")
self.con.execute("create index prankidx on pagerank(urlid)")
# initialize every url with a PageRank of 1.0
self.con.execute("insert into pagerank select rowid,1.0 from urllist")
self.dbcommit()
inlinks={}
numoutlinks={}
pagerank={}
for (urlid,) in self.con.execute("select rowid from urllist"):
inlinks[urlid]=[]
numoutlinks[urlid]=0
# Initialize pagerank vector with 1.0
pagerank[urlid]=1.0
# Loop through all the pages that link to this one
for (inlink,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
inlinks[urlid].append(inlink)
# get number of outgoing links from a page
numoutlinks[urlid]=self.con.execute("select count(*) from link where fromid=%d" % urlid).fetchone()[0]
for i in range(iterations):
print "Iteration %d" % i
for urlid in pagerank:
pr=0.15
for link in inlinks[urlid]:
linkpr=pagerank[link]
linkcount=numoutlinks[link]
pr+=0.85*(linkpr/linkcount)
pagerank[urlid]=pr
for urlid in pagerank:
self.con.execute("update pagerank set score=%f where urlid=%d" % (pagerank[urlid],urlid))
self.dbcommit()
This function is many times faster (but uses a lot more memory for all the temporary dictionaries) because it avoids the unnecessary SQL queries in every iteration:
>>> cProfile.run("crawler.calculatepagerank2()")
90070 function calls in 3.527 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 3.527 3.527 <string>:1(<module>)
1 1.154 1.154 3.523 3.523 searchengine.py:207(calculatepagerank2
2 0.000 0.000 0.058 0.029 searchengine.py:27(dbcommit)
23065 0.013 0.000 0.013 0.000 {method 'append' of 'list' objects}
2 0.058 0.029 0.058 0.029 {method 'commit' of 'sqlite3.Connectio
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler
43932 2.261 0.000 2.261 0.000 {method 'execute' of 'sqlite3.Connecti
23065 0.037 0.000 0.037 0.000 {method 'fetchone' of 'sqlite3.Cursor'
1 0.000 0.000 0.000 0.000 {range}
But is it possible to further reduce the number of SQL queries to speed up the function even more?
Update: Fixed Indentation in calculatepagerank2().
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您有一个非常大的数据库(例如 WWW 中的 # 条记录 ~ # 页),则以类似于书中建议的方式使用数据库是有意义的,因为您无法将所有数据保留在内存中。
如果您的数据集足够小,您(可能)可以通过不进行太多查询来改进第二个版本。尝试用这样的东西替换你的第一个循环:
这个版本正好执行 2 个查询,而不是 O(n^2) 查询。
If you have a very large database (e.g. # records ~ # pages in the WWW) using the database in a manner similar to what's suggested in the book makes sense, because you're not going to be able to keep all that data in memory.
If your dataset is small enough, you can (probably) improve your second version by not doing so many queries. Try replacing your first loop with something like this:
This version does exactly 2 queries instead of O(n^2) queries.
我相信大部分时间都花在这些 SQL 查询上:
假设您有足够的内存,您可以将其减少到只有两个查询:
从链接 WHERE toid IN 中选择 fromid,toid(从 urllist 中选择 rowid)
和
SELECT fromid,count(*) FROM link WHERE fromid IN (SELECT rowid FROM urllist) GROUP BY fromid
然后,您可以循环遍历结果并构建
inlinks
、numoutlinks
和pagerank
。您还可以从使用
collections.defaultdict
:下面的代码使
inlinks
成为集合的字典。集合是合适的,因为你只想要不同的 url
这使得
pagerank
成为一个默认值为 1.0 的字典:使用 collections.defaultdict 的优点是你
不需要预先初始化字典。
所以,总的来说,我的建议看起来像这样:
I believe the majority of the time is being spent on these SQL queries:
Assuming you have enough memory, you may be able to reduce this to just two queries:
SELECT fromid,toid FROM link WHERE toid IN (SELECT rowid FROM urllist)
and
SELECT fromid,count(*) FROM link WHERE fromid IN (SELECT rowid FROM urllist) GROUP BY fromid
Then you could loop through the results and build
inlinks
,numoutlinks
andpagerank
.You may also benefit from using
collections.defaultdict
:The following then makes
inlinks
a dict of sets. Sets are appropriate sinceyou only want distinct urls
And this makes
pagerank
a dict whose default value is 1.0:The advantage of using collections.defaultdict is that you
do not need to pre-initialize the dicts.
So, put together, what I'm suggesting would look something like this:
您是否有足够的 RAM 来保存某种形式的稀疏矩阵
(fromid, toid)
?这将允许进行大的优化(通过大的算法更改)。至少,在内存中缓存您现在在最内层循环中使用select count(*)
执行的(fromid, numlinks)
应该会有所帮助(我想 那个缓存,在空间中是O(N)
(如果您正在处理N
URL),则更有可能适合内存)。Do you have enough RAM to hold the sparse matrix
(fromid, toid)
in some form? That would allow big optimizations (with big algorithmic changes). At least, caching in memory the(fromid, numlinks)
that you now do with aselect count(*)
in your innermost loop should help (I'd imagine that cache, beingO(N)
in space if you're dealing withN
URLs, would be more likely to fit in memory).我正在回答我自己的问题,因为最后发现所有答案的混合最适合我:
所以我按照
allyourcode
的建议替换了前两个循环,但另外还使用了executemany() 如~unutbu
的解决方案中所示。但与~unutbu
不同,我对参数使用生成器表达式,以免浪费太多内存,尽管使用列表理解速度要快一些。最后这个例程比书中建议的例程快了 100 倍:还应该注意以下问题:
%f
进行字符串格式化而不是使用占位符?
构建 SQL 语句时,您将失去精度(例如,我使用?
得到 2.9796095721920315,但使用%f
得到 2.9796100000000001。I'm answering my own question, since in the end it came out that a mixture of all answers worked best for me:
So I replaced the first two loops as suggested by
allyourcode
, but in addition also used executemany() as in the solution from˜unutbu
. But unlike˜unutbu
I use a generator expression for args, to not waste too much memory, although using a list comprehension was a little bit faster. In the end the routine was 100 times faster than the routine suggested in the book:One should also be aware of the following problems:
%f
instead of using a placeholder?
for constructing the SQL statement you will loose precision (e.g. I got 2.9796095721920315 using?
but 2.9796100000000001 using%f
.