从Google App Engine中的巨大列表中计算独特元素
我有一个每月点击量为 15,000,000 次的网络小部件,并且我记录了每个会话。当我想生成报告时,我想知道有多少个唯一 IP。在普通的 SQL 中,这很容易,因为我只需执行以下操作:
SELECT COUNT(*) FROM (SELECT DISTINCT IP FROM SESSIONS)
但由于应用程序引擎不可能做到这一点,因此我现在正在研究如何做到这一点的解决方案。它不需要很快。
我想到的一个解决方案是有一个空的 Unique-IP 表,然后有一个 MapReduce 作业来遍历所有会话实体,如果实体的 IP 不在表中,我将添加它并向计数器添加一个。然后我会有另一个 MapReduce 作业来清理表格。这会疯吗?如果是这样,你会怎么做?
谢谢!
I got a web widget with 15,000,000 hits/months and I log every session. When I want to generate a report I'd like to know how many unique IP there are. In normal SQL that would be easy as I'd just do a:
SELECT COUNT(*) FROM (SELECT DISTINCT IP FROM SESSIONS)
But as that's not possible with the app engine, I'm now looking into solutions on how to do it. It doesn't need to be fast.
A solution I was thinking of was to have an empty Unique-IP table, then have a MapReduce job to go through all session entities, if the entity's IP is not in the table I'll add it and add one to a counter. Then I'd have another MapReduce job that would clear the table. Would this be crazy? If so, how would you do it?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您建议的 MapReduce 方法正是您想要的。不要忘记使用事务来更新任务队列任务中的记录,这将允许您与许多映射器并行运行它。
将来,reduce 支持将通过一个简单的映射缩减来实现这一点,而无需修改您自己的事务和模型。
The mapreduce approach you suggest is exactly what you want. Don't forget to use transactions to update the record in your task queue task, which will allow you to run it in parallel with many mappers.
In future, reduce support will make this possible with a single straightforward mapreduce and no hacking around with your own transactions and models.
如果时间不重要,您可以尝试任务限制为 1 的任务队列。基本上,您会使用递归任务来查询一批日志记录,直到遇到 DeadlineExceededError。然后,您将结果写入数据存储区,任务将使用查询结束游标/最后一条记录的键值将其自身排入队列,以在上次停止的位置开始提取操作。
If time is not important and you may try taskqueue with a task limit of 1. Basically you'd use a recursive task that queries through a batch of log records until it hits DeadlineExceededError. Then you'd write the results to datastore and the task would enqueue itself with the query end cursor/last record's key value to start the fetch operation where it stopped last time.