确定用户最有可能访问过的网站的先验排名
这是针对 http://cssfingerprint.com
我有一个较大的网站数据库(约 100M 行)。这包括主域(2LD 和 3LD)以及从这些域中抓取的特定 URL(无论是托管在那里[像大多数博客]还是仅从它链接[像 Digg],并引用主机域)。
我还抓取了 Alexa 前百万名、Bloglines 前 1000 名、Google pagerank、Technorati 前 100 名和 Quantcast 前百万名排名。但许多域名没有排名,或者只有部分排名;除了 Google 的 0-10 页面排名(有些甚至没有)之外,几乎所有子域 URL 都没有排名。
我可以添加任何必要的新的抓取,假设它不需要大量的蜘蛛抓取。
我还拥有大量有关以前用户访问过的网站的信息。
我需要的是一种算法,可以根据访问者在不了解当前访问者的情况下访问该 URL 的可能性对这些 URL 进行排序。 (但是,它可以使用有关先前用户的聚合信息。)
这个问题只是关于相对固定(或至少聚合)的先验排名;还有另一个问题涉及获得动态排名。
鉴于我的资源(计算资源和财务资源)有限,对我来说按照访问这些网站的先验概率的顺序对这些网站进行排名的最佳方式是什么?
This is for http://cssfingerprint.com
I have a largish database (~100M rows) of websites. This includes both main domains (both 2LD and 3LD) and particular URLs scraped from those domains (whether hosted there [like most blogs] or only linked from it [like Digg], and with a reference to the host domain).
I also scrape the Alexa top million, Bloglines top 1000, Google pagerank, Technorati top 100, and Quantcast top million rankings. Many domains will have no ranking though, or only a partial set; and nearly all sub-domain URLs have no ranking at all other than Google's 0-10 pagerank (some don't even have that).
I can add any new scrapings necessary, assuming it doesn't require a massive amount of spidering.
I also have a fair amount of information about what sites previous users have visited.
What I need is an algorithm that orders these URLs by how likely a visitor is to have visited that URL without any knowledge of the current visitor. (It can, however, use aggregated information about previous users.)
This question is just about the relatively fixed (or at least aggregated) a priori ranking; there's another question that deals with getting a dynamic ranking.
Given that I have limited resources (both computational and financial), what's the best way for me to rank these sites in order of a priori probability of their having been visited?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论