如何从网站上抓取徽标?
首先,这不是一个关于如何抓取网站的问题。我完全了解可用于抓取的工具(css_parser、nokogiri 等。我使用 Ruby 进行抓取)。
这更多的是关于从网站地址开始抓取网站徽标的最佳解决方案的首要问题。
我开始创建的两个解决方案是:
- 使用 Google AJAX API 进行图像搜索,该搜索的范围仅限于相关网站,查询“logo”,并获取第一个结果。我想说,大约 30% 的情况下,它会获得徽标。
- 上述问题是 Google 似乎并不真正关心 CSS 图像替换徽标(即用徽标替换图像的 H1 文本)。我暂时提出的解决方案是拉下所有 CSS 文件,扫描 url() 声明,然后在文件名中查找单词 header 或 logo。
解决方案二是有问题的,因为所有为网站编写 CSS 的人都有很多特质。他们在文件名中使用标题而不是徽标。有时文件名是随机的,没有提及徽标。其他时候,这只是错误的形象。
我意识到我也许能够通过某种机器学习来做一些事情,但我的客户截止日期有点紧迫,并且很快就需要一些相当有能力的东西。
话虽如此,如果有人对此有任何“开箱即用”的想法,我很想听听。如果我可以创建一个运行良好的解决方案,我计划为任何其他感兴趣的各方开源该库:)
谢谢!
First off, this is not a question about how to scrape websites. I am fully aware of the tools available to me to scrape (css_parser, nokogiri, etc. I'm using Ruby to do the scraping).
This is more of an overarching question on the best possible solution to scrape the logo of a website starting with nothing but a website address.
The two solutions I've begun to create are these:
- Use Google AJAX APIs to do an image search that is scoped to the site in question, with the query "logo", and grab the first result. This gets the logo, I'd say, about 30% of the time.
- The problem with the above is that Google doesn't really seem to care about CSS image replaced logos (ie. H1 text that is image replaced with the logo). The solution I've tentatively come up with is to pull down all CSS files, scan for url() declarations, and then look for the words header or logo in the file names.
Solution two is problematic because of the many idiosyncrasies of all the people who write CSS for websites. They use Header instead of logo in the file name. Sometimes the file name is random, saying nothing about a logo. Other times, it's just the wrong image.
I realize I might be able to do something with some sort of machine learning, but I'm on a bit of a deadline for a client and need something fairly capable soon.
So with all that said, if anyone has any "out of the box" thinking on this one, I'd love to hear it. If I can create a solution that works well enough, I plan on open-sourcing the library for any other interested parties :)
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
通过 Clearbit 查看此 API。使用起来非常简单:
只需将查询发送至:
https://logo.clearbit.com/[enter-domain-here]
例如:
https://logo.clearbit.com/www.stackoverflow.com
并取回标志图像!
更多相关信息此处
Check this API by Clearbit. It's super simple to use:
Just send a query to:
https://logo.clearbit.com/[enter-domain-here]
For example:
https://logo.clearbit.com/www.stackoverflow.com
and get back the logo image!
More about it here
我必须为之前的项目找到约 10K 网站的徽标,并尝试了您提到的在 URL 中提取带有“徽标”的图像的相同技术。我的变化是我在 webkit 中加载每个网页,以便所有图像都是从 CSS 或 JavaScript 加载的。这项技术为我提供了约 40% 网站的徽标。
然后我考虑创建一个应用程序,就像尼克建议的那样手动选择其余网站的徽标,但是我意识到将这些网站提供给便宜的人(我通过 Elance)手动完成工作。
所以我建议不要费心用完全技术性的解决方案来正确解决这个问题——外包体力劳动。
I had to find logos for ~10K websites for a previous project and tried the same technique you mentioned of extracting the image with "logo" in the URL. My variation was I loaded each webpage in webkit so that all images were loaded from CSS or JavaScript. This technique gave me logos for ~40% of websites.
Then I considered creating an app like Nick suggested to manually select the logo for the remaining websites, however I realized it was more cost effective to just give these to someone cheap (who I found via Elance) to do the work manually.
So I suggest don't bother solving this properly with a fully technical solution - outsource the manual labour.
创建应用程序肯定会对您有所帮助,但我相信最终会涉及一些手动工作。这就是我要做的。
即使我们可以编写一个应用程序来真正弄清楚它是否是一个徽标,这似乎也会是大量的代码。最后,它可能会比上面的淘汰更多,但你必须考虑到,人类直观地解析结果可能比你编写和测试复杂代码所花费的时间更快。
Creating an application will definetely help you, but I believe in the end there will some manual work involved. Here's what I would do.
Even if it we're possible to write an application to truly figure out if it was a logo or not seems like it would be a massive amount of code. In the end, it would probably weed out even more than the above, but you have to take into account it could be faster for human to visually parse the results then the time it took for you to write and test the complex code.
解决这个问题的另一种简单方法是获取所有叶节点,并获取第一个
可以在网上查找项目以获取 html 叶节点或使用正则表达式获取所有 html 标签的项目。
Yet another simple way to solve this problem is to get all leaf nodes and get the first
you can lookup for projects to get html leaf nodes on the net or use regular expressions to get all html tags.
我使用 C# 控制台应用程序和 HtmlAgilityPack nuget 包从 600 多个网站中抓取徽标。
算法是获取 url 中包含“logo”的所有图像。
在提取过程中您将面临的挑战是:
提出请求之前的协议)
考虑到这一点,
我获得了大约 70% 的成功,但有些图像并不是实际的徽标。
I used C# console app with HtmlAgilityPack nuget package to scrape logos from over 600+ sites.
Algorithm is that you get all images that have "logo" in url.
The challenges you will face with during such extraction are:
protocol before you make a request)
string at the end
With that things in mind I got approximately 70% of success but some images were not actual logos.