提高 BeautifulSoup 性能
所以我有以下一组代码来解析美味的信息。它以以下格式
打印 美味页面中的数据人数
书签|人数 等等...
我曾经使用以下方法来查找此信息。
def extract (soup):
links = soup.findAll('a',rel='nofollow')
for link in links:
print >> outfile, link['href']
hits = soup.findAll('span', attrs={'class': 'delNavCount'})
for hit in hits:
print >> outfile, hit.contents
#File to export data to
outfile = open("output.txt", "w")
#Browser Agent
br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url= "http://www.delicious.com/asd"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
extract(soup)
但问题是有些书签没有人数,所以我决定以不同的方式解析它,这样我就可以同时获取数据并并排打印出书签和人数。
编辑:使用此更新版本从 15 - 5 秒得到它,还有更多建议
def extract (soup):
bookmarkset = soup.findAll('div', 'data')
for bookmark in bookmarkset:
link = bookmark.find('a',)
vote = bookmark.find('span', 'delNavCount')
try:
print >> outfile, link['href'], " | " ,vote.contents
except:
print >> outfile, "[u'0']"
#print bookmarkset
#File to export data to
outfile = open("output.txt", "w")
#Browser Agent
br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url= "http://www.delicious.com/asd"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
extract(soup)
不过,这方面的性能很糟糕,需要 17 秒来解析第一页,然后在一台相当不错的机器上大约需要 15 秒。当从代码的第一位到第二位时,它的性能显着下降。我能做些什么来提高这里的性能吗?
SO I have the following set of code parsing delicious information. It prints data from a Delicious page in the following format
Bookmark | Number of People
Bookmark | Number of People
etc...
I used to use the following method to find this info.
def extract (soup):
links = soup.findAll('a',rel='nofollow')
for link in links:
print >> outfile, link['href']
hits = soup.findAll('span', attrs={'class': 'delNavCount'})
for hit in hits:
print >> outfile, hit.contents
#File to export data to
outfile = open("output.txt", "w")
#Browser Agent
br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url= "http://www.delicious.com/asd"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
extract(soup)
But the problem was that some bookmarks didnt have a number of people, so I decided to parse it different so that I would get the data concurrently and print out the bookmarks and number of people side by side.
EDIT: Got it from 15 - 5 seconds with this updated version, any more suggestions
def extract (soup):
bookmarkset = soup.findAll('div', 'data')
for bookmark in bookmarkset:
link = bookmark.find('a',)
vote = bookmark.find('span', 'delNavCount')
try:
print >> outfile, link['href'], " | " ,vote.contents
except:
print >> outfile, "[u'0']"
#print bookmarkset
#File to export data to
outfile = open("output.txt", "w")
#Browser Agent
br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url= "http://www.delicious.com/asd"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
extract(soup)
The performance on this is terrible though, takes 17 secs to parse the first page, and around 15 secs thereafter on a pretty decent machine. It significantly degraded when going from the first bit of code to the second bit. Is there anything I can do to imporve perf here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不明白您为什么要分配两次
投票
。第一个任务是不必要的,而且实际上相当繁重,因为它必须在每次迭代时解析整个文档。为什么?I don't understand why you are assigning to
vote
– twice. The first assignment is unnecessary and indeed quite heavy, since it must parse the whole document – on each iteration. Why?如果您担心性能,您可能会看看与美味 API 对话的东西,而不是屏幕抓取,即 pydelicious。例如:
If you are concerned about performance you might have a look at something that talks to the delicious API, rather than screen-scraping, ie pydelicious. For example:
在本例中,BeautifulSoup 的功能远远超出您的需要。如果你真的想提高速度,那么我建议采用更基本的 urllib + 简单的逐行解析器方法。
在现代机器上,解析“asd”示例大小的页面应该花费不到一秒的时间。
BeautifulSoup does a lot more than what you need in this instance. If you really want to crank up the speed, then I would suggest taking a more basic approach of urllib + a simple line-by-line parser.
Parsing a page the size of the "asd" example should take well under one second on a modern machine.