在 Google App Engine 上获取不同的用户

发布于 2024-08-19 18:40:32 字数 1904 浏览 5 评论 0原文

如何在 Google App Engine (Python) 上执行此操作：

SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" 
AND t >= start_time AND t <= end_time

长版本：

我有一个 Python Google App Engine 应用程序，其中的用户生成事件（例如综合浏览量）。我想知道在给定的时间范围内有多少唯一用户生成了综合浏览事件。我最感兴趣的时间跨度是一周，一周内大约有一百万个这样的事件。我想在 cron 作业中运行它。

我的事件实体如下所示：

class Event(db.Model):
    t = db.DateTimeProperty(auto_now_add=True)
    user = db.StringProperty(required=True)
    event_type = db.StringProperty(required=True)

使用 SQL 数据库，我会做类似的事情

SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" 
AND t >= start_time AND t <= end_time

首先想到的是获取所有 PAGEVIEW 事件并过滤掉重复的用户。例如：

query = Event.all()
query.filter("t >=", start_time)
query.filter("t <=", end_time)
usernames = []
for event in query:
    usernames.append(event.user)
answer = len(set(usernames))

但这行不通，因为它最多只能支持 1000 个事件。我想到的接下来的事情是获取 1000 个事件，然后当这些事件用完时获取下 1000 个事件，依此类推。但这也行不通，因为执行一千个查询并检索一百万个实体将花费 30 秒以上，这是请求时间限制。

然后我想我应该按用户排序以更快地跳过重复项。但这是不允许的，因为我已经在使用不等式“t >= start_time AND t <= end_time”。

显然这不可能在 30 秒内完成，因此需要将其分段。但寻找不同的项目似乎并不能很好地分解为子任务。我能想到的最好办法是在每个 cron 作业调用中查找 1000 个页面浏览事件，然后从中获取不同的用户名，并将它们放入像 Chard 这样的实体中。它可能看起来像这样

class Chard(db.Model):
    usernames = db.StringListProperty(required=True)

，所以每个甜菜中最多有 1000 个用户名，如果删除了重复的用户名，则数量会更少。大约 16 小时后（这很好），我会得到所有的甜菜，并且可以做类似的事情：

chards = Chard.all()
all_usernames = set()
for chard in chards:
    all_usernames = all_usernames.union(chard.usernames)
answer = len(all_usernames)

看起来它可能有效，但不是一个漂亮的解决方案。如果有足够多的唯一用户，这个循环可能会花费很长时间。我还没有测试它，希望有人能提出更好的建议，所以如果这个循环足够快的话就不会。

对于我的问题有更好的解决方案吗？

当然，所有这些独特的用户计数都可以使用 Google Analytics 轻松完成，但我正在构建应用程序特定指标的仪表板，并希望这是许多统计数据中的第一个。

原文

How to do this on Google App Engine (Python):

SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" 
AND t >= start_time AND t <= end_time

Long version:

I have a Python Google App Engine application with users that generate events, such as pageviews. I would like to know in a given timespan how many unique users generated a pageview event. The timespan I am most interested in is one week, and there are about a million such events in a given week. I want to run this in a cron job.

My event entities look like this:

class Event(db.Model):
    t = db.DateTimeProperty(auto_now_add=True)
    user = db.StringProperty(required=True)
    event_type = db.StringProperty(required=True)

With an SQL database, I would do something like

SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" 
AND t >= start_time AND t <= end_time

First thought that occurs is to get all PAGEVIEW events and filter out duplicate users. Something like:

query = Event.all()
query.filter("t >=", start_time)
query.filter("t <=", end_time)
usernames = []
for event in query:
    usernames.append(event.user)
answer = len(set(usernames))

But this won't work, because it will only support up to 1000 events. Next thing that occurs to me is to get 1000 events, then when those run out get the next thousand and so on. But that won't work either, because going through a thousand queries and retrieving a million entities would take over 30 seconds, which is the request time limit.

Then I thought I should ORDER BY user to faster skip over duplicates. But that is not allowed because I am already using the inequality "t >= start_time AND t <= end_time".

It seems clear this cannot be accomplished under 30 seconds, so it needs to be fragmented. But finding distinct items seems like it doesn't split well into subtasks. Best I can think of is on every cron jobcall to find 1000 pageview events and then get distinct usernames from those, and put them in an entity like Chard. It could look something like

class Chard(db.Model):
    usernames = db.StringListProperty(required=True)

So each chard would have up to 1000 usernames in it, less if there were duplicates that got removed. After about a 16 hours (which is fine) I would have all the chards and could do something like:

chards = Chard.all()
all_usernames = set()
for chard in chards:
    all_usernames = all_usernames.union(chard.usernames)
answer = len(all_usernames)

It seems like it might work, but hardly a beautiful solution. And with enough unique users this loop might take too long. I haven't tested it in hopes someone will come up with a better suggestion, so not if this loop would turn out to be fast enough.

Is there any prettier solution to my problem?

Of course all of this unique user counting could be accomplished easily with Google Analytics, but I am constructing a dashboard of application specific metrics, and intend this to be the first of many stats.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

等往事风中吹 2024-08-26 18:40:32

从 SDK v1.7.4 开始，现在对 DISTINCT 函数提供实验性支持。

请参阅：https://developers.google.com/appengine/docs/python/datastore /gql参考

回复收藏 0 原文

你怎么敢 2024-08-26 18:40:32

这是一个可能可行的解决方案。它在一定程度上依赖于内存缓存的使用，因此您的数据总是有可能以不可预测的方式被驱逐。 买者自负。

您将有一个名为 unique_visits_today 或类似名称的 memcache 变量。每次用户获得当天的第一个综合浏览量时，您都可以使用 .incr() 函数来增加该计数器。

通过查看附加到用户的 last_activity_day 字段来确定这是用户的第一次访问。当用户访问时，您查看该字段，如果是昨天，则将其更新为今天并增加您的内存缓存计数器。

每天午夜，cron 作业将获取内存缓存计数器中的当前值并将其写入数据存储，同时将计数器设置为零。您将拥有这样一个模型：

class UniqueVisitsRecord(db.Model):
    # be careful setting date correctly if processing at midnight
    activity_date = db.DateProperty()
    event_count = IntegerProperty()

然后，您可以简单、轻松、快速地获取与任何日期范围匹配的所有 UnqiueVisitsRecord，并将其 event_count 字段中的数字相加。

Here is a possibly-workable solution. It relies to an extent on using memcache, so there is always the possibility that your data would get evicted in an unpredictable fashion. Caveat emptor.

You would have a memcache variable called unique_visits_today or something similar. Every time that a user had their first pageview of the day, you would use the .incr() function to increment that counter.

Determining that this is the user's first visit is accomplished by looking at a last_activity_day field attached to the user. When the user visits, you look at that field, and if it is yesterday, you update it to today and increment your memcache counter.

At midnight each day, a cron job would take the current value in the memcache counter and write it to the datastore while setting the counter to zero. You would have a model like this:

class UniqueVisitsRecord(db.Model):
    # be careful setting date correctly if processing at midnight
    activity_date = db.DateProperty()
    event_count = IntegerProperty()

You could then simply, easily, quickly get all of the UnqiueVisitsRecords that match any date range and add up the numbers in their event_count fields.

回复收藏 0 原文