提高 BeautifulSoup 性能

发布于 2024-10-08 11:09:06 字数 1725 浏览 13 评论 0原文

所以我有以下一组代码来解析美味的信息。它以以下格式

打印美味页面中的数据人数

书签|人数等等...

我曾经使用以下方法来查找此信息。

def extract (soup):
    links = soup.findAll('a',rel='nofollow')
    for link in links:
        print >> outfile, link['href']

    hits = soup.findAll('span', attrs={'class': 'delNavCount'})
    for hit in hits:
        print >> outfile, hit.contents


#File to export data to
outfile = open("output.txt", "w")

#Browser Agent
br = Browser()    
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]


url= "http://www.delicious.com/asd"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
extract(soup)

但问题是有些书签没有人数，所以我决定以不同的方式解析它，这样我就可以同时获取数据并并排打印出书签和人数。

编辑：使用此更新版本从 15 - 5 秒得到它，还有更多建议

def extract (soup):
    bookmarkset = soup.findAll('div', 'data')
    for bookmark in bookmarkset:
        link = bookmark.find('a',)
        vote = bookmark.find('span', 'delNavCount')
        try:
            print >> outfile, link['href'], " | " ,vote.contents
        except:
            print >> outfile, "[u'0']"
    #print bookmarkset


#File to export data to
outfile = open("output.txt", "w")

#Browser Agent
br = Browser()    
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]


url= "http://www.delicious.com/asd"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
extract(soup)

不过，这方面的性能很糟糕，需要 17 秒来解析第一页，然后在一台相当不错的机器上大约需要 15 秒。当从代码的第一位到第二位时，它的性能显着下降。我能做些什么来提高这里的性能吗？

原文

SO I have the following set of code parsing delicious information. It prints data from a Delicious page in the following format

Bookmark | Number of People

Bookmark | Number of People
etc...

I used to use the following method to find this info.

def extract (soup):
    links = soup.findAll('a',rel='nofollow')
    for link in links:
        print >> outfile, link['href']

    hits = soup.findAll('span', attrs={'class': 'delNavCount'})
    for hit in hits:
        print >> outfile, hit.contents


#File to export data to
outfile = open("output.txt", "w")

#Browser Agent
br = Browser()    
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]


url= "http://www.delicious.com/asd"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
extract(soup)

But the problem was that some bookmarks didnt have a number of people, so I decided to parse it different so that I would get the data concurrently and print out the bookmarks and number of people side by side.

EDIT: Got it from 15 - 5 seconds with this updated version, any more suggestions

def extract (soup):
    bookmarkset = soup.findAll('div', 'data')
    for bookmark in bookmarkset:
        link = bookmark.find('a',)
        vote = bookmark.find('span', 'delNavCount')
        try:
            print >> outfile, link['href'], " | " ,vote.contents
        except:
            print >> outfile, "[u'0']"
    #print bookmarkset


#File to export data to
outfile = open("output.txt", "w")

#Browser Agent
br = Browser()    
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]


url= "http://www.delicious.com/asd"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
extract(soup)

The performance on this is terrible though, takes 17 secs to parse the first page, and around 15 secs thereafter on a pretty decent machine. It significantly degraded when going from the first bit of code to the second bit. Is there anything I can do to imporve perf here?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

演多会厌 2024-10-15 11:09:06

我不明白您为什么要分配两次投票。第一个任务是不必要的，而且实际上相当繁重，因为它必须在每次迭代时解析整个文档。为什么？

   vote = BeautifulSoup(html)
   vote = bookmark.findAll('span', attrs={'class': 'delNavCount'})

I don't understand why you are assigning to vote – twice. The first assignment is unnecessary and indeed quite heavy, since it must parse the whole document – on each iteration. Why?

   vote = BeautifulSoup(html)
   vote = bookmark.findAll('span', attrs={'class': 'delNavCount'})

回复收藏 0 原文

椒妓 2024-10-15 11:09:06

如果您担心性能，您可能会看看与美味 API 对话的东西，而不是屏幕抓取，即 pydelicious。例如：

>>> import pydelicious
>>> pydelicious.get_userposts('asd')
[{'extended': '', 'description': u'Ksplice - System administration and software blog', 'tags': u'sysadm, blog, interesting', 'url': u'http://blog.ksplice.com/', 'user': u'asd'

If you are concerned about performance you might have a look at something that talks to the delicious API, rather than screen-scraping, ie pydelicious. For example:

>>> import pydelicious
>>> pydelicious.get_userposts('asd')
[{'extended': '', 'description': u'Ksplice - System administration and software blog', 'tags': u'sysadm, blog, interesting', 'url': u'http://blog.ksplice.com/', 'user': u'asd'

回复收藏 0 原文