用于计算大基数的 LogLog 和 HyperLogLog 算法
在哪里可以找到 LogLog 算法 的有效实现?我尝试自己实现它,但我的草案实现产生了奇怪的结果。
这里是:
function LogLog(max_error, max_count)
{
function log2(x)
{
return Math.log(x) / Math.LN2;
}
var m = 1.30 / max_error;
var k = Math.ceil(log2(m * m));
m = Math.pow(2, k);
var k_comp = 32 - k;
var l = log2(log2(max_count / m));
if (isNaN(l)) l = 1; else l = Math.ceil(l);
var l_mask = ((1 << l) - 1) >>> 0;
var M = [];
for (var i = 0; i < m; ++i) M[i] = 0;
function count(hash)
{
if (hash !== undefined)
{
var j = hash >>> k_comp;
var rank = 0;
for (var i = 0; i < k_comp; ++i)
{
if ((hash >>> i) & 1)
{
rank = i + 1;
break;
}
}
M[j] = Math.max(M[j], rank & l_mask);
}
else
{
var c = 0;
for (var i = 0; i < m; ++i) c += M[i];
return 0.79402 * m * Math.pow(2, c / m);
}
}
return {count: count};
}
function fnv1a(text)
{
var hash = 2166136261;
for (var i = 0; i < text.length; ++i)
{
hash ^= text.charCodeAt(i);
hash += (hash << 1) + (hash << 4) + (hash << 7) +
(hash << 8) + (hash << 24);
}
return hash >>> 0;
}
var words = ['aardvark', 'abyssinian', ... ,'zoology']; // about 2 300 words
var log_log = LogLog(0.01, 100000);
for (var i = 0; i < words.length; ++i) log_log.count(fnv1a(words[i]));
alert(log_log.count());
由于未知原因,实现对 max_error
参数非常敏感,它是决定结果大小的主要因素。我确信,有一些愚蠢的错误:)
更新:这个问题在较新版本的算法。我稍后会发布它的实现。
Where can I find a valid implementation of LogLog algorithm? Have tried to implement it by myself but my draft implementation yields strange results.
Here it is:
function LogLog(max_error, max_count)
{
function log2(x)
{
return Math.log(x) / Math.LN2;
}
var m = 1.30 / max_error;
var k = Math.ceil(log2(m * m));
m = Math.pow(2, k);
var k_comp = 32 - k;
var l = log2(log2(max_count / m));
if (isNaN(l)) l = 1; else l = Math.ceil(l);
var l_mask = ((1 << l) - 1) >>> 0;
var M = [];
for (var i = 0; i < m; ++i) M[i] = 0;
function count(hash)
{
if (hash !== undefined)
{
var j = hash >>> k_comp;
var rank = 0;
for (var i = 0; i < k_comp; ++i)
{
if ((hash >>> i) & 1)
{
rank = i + 1;
break;
}
}
M[j] = Math.max(M[j], rank & l_mask);
}
else
{
var c = 0;
for (var i = 0; i < m; ++i) c += M[i];
return 0.79402 * m * Math.pow(2, c / m);
}
}
return {count: count};
}
function fnv1a(text)
{
var hash = 2166136261;
for (var i = 0; i < text.length; ++i)
{
hash ^= text.charCodeAt(i);
hash += (hash << 1) + (hash << 4) + (hash << 7) +
(hash << 8) + (hash << 24);
}
return hash >>> 0;
}
var words = ['aardvark', 'abyssinian', ... ,'zoology']; // about 2 300 words
var log_log = LogLog(0.01, 100000);
for (var i = 0; i < words.length; ++i) log_log.count(fnv1a(words[i]));
alert(log_log.count());
For unknown reason implementation is very sensitive to max_error
parameter, it is the main factor that determines the magnitude of the result. I'm sure, there is some stupid mistake :)
UPDATE: This problem is solved in the newer version of algorithm. I will post its implementation later.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这里是基于较新的论文:
Here it is the updated version of the algorithm based on the newer paper:
这是一个稍微修改的版本,添加了合并操作。
Merge 允许您从 HyperLogLog 的多个实例中获取计数器,
并确定总体上唯一的计数器。
例如,如果您在周一、周二和周三收集了唯一身份访客,
然后您可以将存储桶合并在一起并计算唯一访问者的数量
在三天的时间里:
然后你可以这样做:
Here is a slightly modified version which adds the merge operation.
Merge allows you to take the counters from several instances of HyperLogLog,
and determine the unique counters overall.
For example, if you have unique visitors collected on Monday, Tuesday and Wednesday,
then you can merge the buckets together and count the number of unique visitors
over the three day span:
Then you can do something like this:
我们开源了一个名为 Stream-Lib 的项目,它有一个 LogLog 实现。这项工作基于本文。
We've open sourced a project called Stream-Lib that has a LogLog implementation. The work was based on this paper.
使用 @actual 提供的 js 版本,我尝试在 C# 中实现相同的功能,这看起来足够接近。只是稍微更改了 fnv1a 函数并将其重命名为 getHashCode。 (归功于 Jenkins 哈希函数,http://en.wikipedia.org/wiki/Jenkins_hash_function )
Using the js version @actual provided, I tried to implement the same in C#, which seems close enough. Just changed fnv1a function a little bit and renamed it to getHashCode. (Credit goes to Jenkins hash function, http://en.wikipedia.org/wiki/Jenkins_hash_function)
我知道这是一篇旧帖子,但 @buryat 实现已经移动,并且无论如何都不完整,而且有点慢(抱歉 o_o )。
我采用了新 Redis 版本使用的实现,可以在 此处找到 并将其移植到 PHP。仓库在这里 https://github.com/joegreen0991/HyperLogLog
I know this is an old post but the @buryat implementation has moved, and is in any case incomplete, and a bit on the slow side (sorry o_o ).
I've taken the implementation used by the new Redis release which can be found here and ported it to PHP. The repo is here https://github.com/joegreen0991/HyperLogLog
我在 JS 和 PHP 中实现了 loglog 和 hyperloglog 以及注释良好的代码 https://github.com/buryat/loglog
I implemented loglog and hyperloglog in JS and PHP and well-commented code https://github.com/buryat/loglog