You can try PublicWWW for search in source/mark-up. It allows to find any HTML, JavaScript, CSS and plain text in web page source code on 167+ million websites.
With PublicWWW you can:
Find related websites through the unique HTML codes they share, i.e.
widgets & publisher IDs.
Identify sites using certain images or badges.
Find out who else is using your theme.
Identify sites mentioning you.
Find your competitor's affiliates.
Identify sites where your competitors personally collaborate or interact.
References to use a library or a platform.
Find code examples on the net.
Figure out who is using what JS widgets on their sites.
...
Of course you can find not only your websites which use some code/mark-up snippet.
发布评论
评论(2)
我在旅行中遇到了以下资源(上面已经提到了一些资源):
专注于 HTML 标记的搜索引擎
我还想添加以下内容:
巨大的网站抓取数据档案
我们如何分析此爬网数据?
了解如何开始要分析其中一些海量数据,请查看大数据/Map-reduce-type 框架。
Google 列出了一些关于使用 Apache Spark 项目的想法分析Common Crawl 的转储。要了解Common Crawl 使用的文件格式,请参阅以下内容:
文章,访问通用-Crawl-Dataset-on-S3,概述以低成本访问 Common Crawl 的 250TB+ 转储 无需将该数据负载传输到 Amazon 的 AWS/S3 网络之外。当然,假设您将使用某种组合 AWS/EC2/S3 等. 分析抓取的数据。
最后,Patrick Durusau 维护一些有趣的 Common-Crawl-usage 相关博客页面。
就我个人而言,我觉得这个主题很有趣,我建议我们趁热获取此爬网数据!;-)
I've come across the following resources on my travels (some already mentioned above):
HTML Mark-up-focused search engines
I'd also like to throw in the following:
Huge, website crawl data archives
How can we analyze this crawl data?
For an idea of how to begin analyzing some of this massive data, take a look at Big Data/Map-reduce-type frameworks(s).
Google lists some ideas on using Apache's Spark project to analyze Common Crawl's dump(s). To understand the file format(s) used by Common Crawl, refer to the following:
The article, Accessing-Common-Crawl-Dataset-on-S3, outlines accessing Common Crawl's 250TB+ dump(s) in a low cost manner without transferring that data load outside of Amazon's AWS/S3 network. Of course, that assumes you are going to use some combination AWS/EC2/S3 etc. to analyze the crawl data.
Finally, Patrick Durusau maintains some interesting Common-Crawl-usage-related blog pages.
Personally, I find this subject intriguing, I suggest we get this crawl data while it's HOT! ;-)
您可以尝试使用 PublicWWW 在源/标记中进行搜索。它允许在超过 1.67 亿个网站的网页源代码中查找任何 HTML、JavaScript、CSS 和纯文本。
使用 PublicWWW,您可以:
通过它们共享的独特 HTML 代码查找相关网站,即
小部件和发布商 ID。
识别使用特定图像或徽章的网站。
当然,您不仅可以找到使用某些代码/标记片段的网站。
You can try PublicWWW for search in source/mark-up. It allows to find any HTML, JavaScript, CSS and plain text in web page source code on 167+ million websites.
With PublicWWW you can:
Find related websites through the unique HTML codes they share, i.e.
widgets & publisher IDs.
Identify sites using certain images or badges.
Of course you can find not only your websites which use some code/mark-up snippet.