从Google App Engine中的巨大列表中计算独特元素

发布于 2024-10-30 12:50:40 字数 388 浏览 1 评论 0原文

我有一个每月点击量为 15,000,000 次的网络小部件，并且我记录了每个会话。当我想生成报告时，我想知道有多少个唯一 IP。在普通的 SQL 中，这很容易，因为我只需执行以下操作：

SELECT COUNT(*) FROM (SELECT DISTINCT IP FROM SESSIONS)

但由于应用程序引擎不可能做到这一点，因此我现在正在研究如何做到这一点的解决方案。它不需要很快。

我想到的一个解决方案是有一个空的 Unique-IP 表，然后有一个 MapReduce 作业来遍历所有会话实体，如果实体的 IP 不在表中，我将添加它并向计数器添加一个。然后我会有另一个 MapReduce 作业来清理表格。这会疯吗？如果是这样，你会怎么做？

谢谢！

原文

I got a web widget with 15,000,000 hits/months and I log every session. When I want to generate a report I'd like to know how many unique IP there are. In normal SQL that would be easy as I'd just do a:

SELECT COUNT(*) FROM (SELECT DISTINCT IP FROM SESSIONS)

But as that's not possible with the app engine, I'm now looking into solutions on how to do it. It doesn't need to be fast.

A solution I was thinking of was to have an empty Unique-IP table, then have a MapReduce job to go through all session entities, if the entity's IP is not in the table I'll add it and add one to a counter. Then I'd have another MapReduce job that would clear the table. Would this be crazy? If so, how would you do it?

Thanks!

分享到QQ

分享到微博