如何使用beautiful soup和python获取favicon
我写了一些愚蠢的代码只是为了学习,但它不适用于任何网站。 这是代码:
import urllib2, re
from BeautifulSoup import BeautifulSoup as Soup
class Founder:
def Find_all_links(self, url):
page_source = urllib2.urlopen(url)
a = page_source.read()
soup = Soup(a)
a = soup.findAll(href=re.compile(r'/.a\w+'))
return a
def Find_shortcut_icon (self, url):
a = self.Find_all_links(url)
b = ''
for i in a:
strre=re.compile('shortcut icon', re.IGNORECASE)
m=strre.search(str(i))
if m:
b = i["href"]
return b
def Save_icon(self, url):
url = self.Find_shortcut_icon(url)
print url
host = re.search(r'[0-9a-zA-Z]{1,20}\.[a-zA-Z]{2,4}', url).group()
opener = urllib2.build_opener()
icon = opener.open(url).read()
file = open(host+'.ico', "wb")
file.write(icon)
file.close()
print '%s icon successfully saved' % host
c = Founder()
print c.Save_icon('http://lala.ru')
最奇怪的是它适用于网站: http://habrahabr.ru http://5pd.ru
但对于我检查过的大多数其他人来说不起作用。
I wrote some stupid code for learning just, but it doesn't work for any sites.
here is the code:
import urllib2, re
from BeautifulSoup import BeautifulSoup as Soup
class Founder:
def Find_all_links(self, url):
page_source = urllib2.urlopen(url)
a = page_source.read()
soup = Soup(a)
a = soup.findAll(href=re.compile(r'/.a\w+'))
return a
def Find_shortcut_icon (self, url):
a = self.Find_all_links(url)
b = ''
for i in a:
strre=re.compile('shortcut icon', re.IGNORECASE)
m=strre.search(str(i))
if m:
b = i["href"]
return b
def Save_icon(self, url):
url = self.Find_shortcut_icon(url)
print url
host = re.search(r'[0-9a-zA-Z]{1,20}\.[a-zA-Z]{2,4}', url).group()
opener = urllib2.build_opener()
icon = opener.open(url).read()
file = open(host+'.ico', "wb")
file.write(icon)
file.close()
print '%s icon successfully saved' % host
c = Founder()
print c.Save_icon('http://lala.ru')
The most strange thing is it works for site:
http://habrahabr.ru
http://5pd.ru
But doesn't work for most others that I've checked.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
你让事情变得比需要的复杂得多。这是一个简单的方法:
You're making it far more complicated than it needs to be. Here's a simple way to do it:
Thomas K 的回答让我朝着正确的方向开始,但我发现一些网站没有说 rel="shortcut icon",例如 1800contacts.com 只说 rel="icon"。这适用于 Python 3 并返回链接。如果需要,您可以将其写入文件。
Thomas K's answer got me started in the right direction, but I found some websites that didn't say rel="shortcut icon", like 1800contacts.com that says just rel="icon". This works in Python 3 and returns the link. You can write that to file if you want.
如果有人想使用正则表达式进行一次检查,以下方法对我有用:
这也将解释区分大小写的情况。
In case anyone wants to use a single check with regex, the following works for me:
This will also account for occurrences of case sensitivity.
谢谢你,库尔德人。这是经过一些更改的代码:
Thank you, kurd. Here is the code with some changes:
谢谢你,托马斯。
这是经过一些更改的代码:
Thank you, Thomas.
Here is the code wiith some changes: