如何从网站上抓取徽标?

发布于 2024-10-31 06:05:12 字数 639 浏览 4 评论 0原文

首先,这不是一个关于如何抓取网站的问题。我完全了解可用于抓取的工具(css_parser、nokogiri 等。我使用 Ruby 进行抓取)。

这更多的是关于从网站地址开始抓取网站徽标的最佳解决方案的首要问题。

我开始创建的两个解决方案是:

  1. 使用 Google AJAX API 进行图像搜索,该搜索的范围仅限于相关网站,查询“logo”,并获取第一个结果。我想说,大约 30% 的情况下,它会获得徽标。
  2. 上述问题是 Google 似乎并不真正关心 CSS 图像替换徽标(即用徽标替换图像的 H1 文本)。我暂时提出的解决方案是拉下所有 CSS 文件,扫描 url() 声明,然后在文件名中查找单词 header 或 logo。

解决方案二是有问题的,因为所有为网站编写 CSS 的人都有很多特质。他们在文件名中使用标题而不是徽标。有时文件名是随机的,没有提及徽标。其他时候,这只是错误的形象。

我意识到我也许能够通过某种机器学习来做一些事情,但我的客户截止日期有点紧迫,并且很快就需要一些相当有能力的东西。

话虽如此,如果有人对此有任何“开箱即用”的想法,我很想听听。如果我可以创建一个运行良好的解决方案,我计划为任何其他感兴趣的各方开源该库:)

谢谢!

First off, this is not a question about how to scrape websites. I am fully aware of the tools available to me to scrape (css_parser, nokogiri, etc. I'm using Ruby to do the scraping).

This is more of an overarching question on the best possible solution to scrape the logo of a website starting with nothing but a website address.

The two solutions I've begun to create are these:

  1. Use Google AJAX APIs to do an image search that is scoped to the site in question, with the query "logo", and grab the first result. This gets the logo, I'd say, about 30% of the time.
  2. The problem with the above is that Google doesn't really seem to care about CSS image replaced logos (ie. H1 text that is image replaced with the logo). The solution I've tentatively come up with is to pull down all CSS files, scan for url() declarations, and then look for the words header or logo in the file names.

Solution two is problematic because of the many idiosyncrasies of all the people who write CSS for websites. They use Header instead of logo in the file name. Sometimes the file name is random, saying nothing about a logo. Other times, it's just the wrong image.

I realize I might be able to do something with some sort of machine learning, but I'm on a bit of a deadline for a client and need something fairly capable soon.

So with all that said, if anyone has any "out of the box" thinking on this one, I'd love to hear it. If I can create a solution that works well enough, I plan on open-sourcing the library for any other interested parties :)

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

堇色安年 2024-11-07 06:05:12

通过 Clearbit 查看此 API。使用起来非常简单:

只需将查询发送至:
https://logo.clearbit.com/[enter-domain-here]

例如:
https://logo.clearbit.com/www.stackoverflow.com

并取回标志图像!

更多相关信息此处

Check this API by Clearbit. It's super simple to use:

Just send a query to:
https://logo.clearbit.com/[enter-domain-here]

For example:
https://logo.clearbit.com/www.stackoverflow.com

and get back the logo image!

More about it here

顾北清歌寒 2024-11-07 06:05:12

我必须为之前的项目找到约 10K 网站的徽标,并尝试了您提到的在 URL 中提取带有“徽标”的图像的相同技术。我的变化是我在 webkit 中加载每个网页,以便所有图像都是从 CSS 或 JavaScript 加载的。这项技术为我提供了约 40% 网站的徽标。

然后我考虑创建一个应用程序,就像尼克建议的那样手动选择其余网站的徽标,但是我意识到将这些网站提供给便宜的人(我通过 Elance)手动完成工作。

所以我建议不要费心用完全技术性的解决方案来正确解决这个问题——外包体力劳动。

I had to find logos for ~10K websites for a previous project and tried the same technique you mentioned of extracting the image with "logo" in the URL. My variation was I loaded each webpage in webkit so that all images were loaded from CSS or JavaScript. This technique gave me logos for ~40% of websites.

Then I considered creating an app like Nick suggested to manually select the logo for the remaining websites, however I realized it was more cost effective to just give these to someone cheap (who I found via Elance) to do the work manually.

So I suggest don't bother solving this properly with a fully technical solution - outsource the manual labour.

榆西 2024-11-07 06:05:12

创建应用程序肯定会对您有所帮助,但我相信最终会涉及一些手动工作。这就是我要做的。

  • 让您的应用程序在数据库中存储指向网站上大于指定尺寸的所有图像的链接,以便您可以清除小图标。
  • 然后您可以设置一个表单来访问这些结果。您可能需要设置数据库表来存储网站 url 以及 url 和图像链接之间的关系。

即使我们可以编写一个应用程序来真正弄清楚它是否是一个徽标,这似乎也会是大量的代码。最后,它可能会比上面的淘汰更多,但你必须考虑到,人类直观地解析结果可能比你编写和测试复杂代码所花费的时间更快。

Creating an application will definetely help you, but I believe in the end there will some manual work involved. Here's what I would do.

  • Have your application store in a database a link to all images on a website that are larger than a specified dimension so that you can weed out small icons.
  • Then you can setup a form to access these results. You may want to setup the database table to store the website url and relationship between the url and image links.

Even if it we're possible to write an application to truly figure out if it was a logo or not seems like it would be a massive amount of code. In the end, it would probably weed out even more than the above, but you have to take into account it could be faster for human to visually parse the results then the time it took for you to write and test the complex code.

风尘浪孓 2024-11-07 06:05:12

解决这个问题的另一种简单方法是获取所有叶节点,并获取第一个

<a><img src="http://example.com/a/file.png" /></a>

可以在网上查找项目以获取 html 叶节点或使用正则表达式获取所有 html 标签的项目。

Yet another simple way to solve this problem is to get all leaf nodes and get the first

<a><img src="http://example.com/a/file.png" /></a>

you can lookup for projects to get html leaf nodes on the net or use regular expressions to get all html tags.

酒解孤独 2024-11-07 06:05:12

我使用 C# 控制台应用程序和 HtmlAgilityPack nuget 包从 600 多个网站中抓取徽标。
算法是获取 url 中包含“logo”的所有图像。
在提取过程中您将面临的挑战是:

  • 相对图像
  • 基本 url 是 CDN HTTP/HTTPS(如果您不知道
    提出请求之前的协议)
  • 图像有?或&带查询
    考虑到这一点,

我获得了大约 70% 的成功,但有些图像并不是实际的徽标。

I used C# console app with HtmlAgilityPack nuget package to scrape logos from over 600+ sites.
Algorithm is that you get all images that have "logo" in url.
The challenges you will face with during such extraction are:

  • Relative images
  • Base url is CDN HTTP/HTTPS (if you don't know
    protocol before you make a request)
  • Images have ? or & with query
    string at the end

With that things in mind I got approximately 70% of success but some images were not actual logos.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文