GAE 数据存储 - 写入多于读取时的最佳实践

发布于 2024-10-31 10:54:56 字数 2822 浏览 3 评论 0原文

我正在尝试使用 GAE 数据存储进行一些练习,以了解查询和计费机制。

我读过 Oreilly 关于 GAE 的书,并观看了关于数据存储的 Google 视频。我的问题是,最佳实践方法通常涉及对数据存储的读取多于写入。

我构建了一个超级简单的应用程序:

  • 有两个网页 - 一个是 选择链接,并选择一个视图 链接
  • 每个用户都可以选择将 url 链接添加到他的“链接提要”中
  • 用户可以随时选择任意数量的链接。
  • 在另一个网页上,我想向用户显示他选择的最近 10 个链接。
  • 每个用户都有自己的“链接源”网页。
  • 在每个“链接”上,我想保存并显示一些元数据 - 例如:url 链接本身;当它被选择时;它已经出现在提要上多少次了; 在这种情况下,由于用户可以选择他想要的任意数量的链接,因此

只要他愿意,我的应用程序就会写入数据存储,远多于读取的数量(写入 - 当用户选择另一个链接时;读取 - 当用户选择另一个链接时)打开网页查看他的“链接源”)

问题 1: 我可以想到(至少)两个选项来处理此应用程序的数据:

选项 A: - 维护每个用户的实体,包括用户详细信息、注册等 - 为每个用户维护另一个实体,其中包含他最近选择的 10 个链接,这些链接将在用户请求后呈现到用户的网页

选项 B: - 维护每个 url 链接的实体 - 这意味着所有用户的所有 url 将存储为同一对象 - 维护每个用户的实体详细信息(与选项 A 相同),但在 url 的大表中添加对用户 url 的引用

更好的方法是什么?

问题 2: 如果我想计算截至今天为止选择的 url 总数,或用户选择的每日 url 数量,或任何其他计数 - 我应该将其与我的 SDK 工具一起使用,还是应该在上面描述的实体中插入计数器? (我想尽可能减少数据存储写入量)

编辑(回答@Elad的评论): 假设我只想保存每个用户最后 10 个网址。我想删除其余的(这样就不会用不必要的数据填充我的数据库)。

编辑2:添加代码后 所以我用下面的代码进行了尝试(首先尝试 Elad 的方法):

这是我的类:

class UserChannel(db.Model):
currentUser = db.UserProperty()
userCount = db.IntegerProperty(default=0)
currentList = db.StringListProperty() #holds the last 20-30 urls

然后我序列化了 url &元数据转换为 JSON 字符串,用户从第一页发布这些字符串。 以下是 POST 的处理方式:

def post(self):
    user = users.get_current_user()
    if user:  
        logging messages for debugging
        self.response.headers['Content-Type'] = 'text/html'
        #self.response.out.write('<p>the user_id is: %s</p>' % user.user_id())            
        updating the new item that user adds
        current_user = UserChannel.get_by_key_name(user.nickname())
        dataJson = self.request.get('dataJson')
        #self.response.out.write('<p>the dataJson is: %s</p>' % dataJson) 
        current_user.currentPlaylist.append(dataJson)
        sizePlaylist= len(current_user.currentPlaylist)
        self.response.out.write('<p>size of currentplaylist is: %s</p>' % sizePlaylist)
        #whenever the list gets to 30 I cut it to be 20 long
        if sizePlaylist > 30:
            for i in range (0,9):
                current_user.currentPlaylist.pop(i)
        current_user.userCount +=1
        current_user.put()
        Updater().send_update(dataJson) 
    else:
        self.response.headers['Content-Type'] = 'text/html'
        self.response.out.write('user_not_logged_in')

其中 Updater 是我使用 Channel-API 更新带有提要的网页的方法。

现在,一切正常了,我可以看到每个用户都有一个包含 20-30 个链接的 ListProperty(当达到 30 个时,我使用 pop() 将其减少到 20 个),但是!价格相当高... 像这里这样的每个 POST 大约需要 200 毫秒、121 cpu_ms、cpm_usd= 0.003588。考虑到我所做的只是将字符串保存到列表中,这是非常昂贵的...... 我认为问题可能是实体随着大 ListProperty 变得越来越大?

I'm trying to do some practicing with the GAE datastore to get a feeling about the queries and billings mechanisms.

I've read the Oreilly book about the GAE, and watched the Google videos about the datastore. My problem is that the best practice methods are usually concerning more reads than writes to the datastore.

I Built a super simple app:

  • there are two webpages - one to
    choose links, and one view chosen
    links
  • every user can choose to add url links to his "links feed"
  • the user can choose as many links as he wants, whenever he wants.
  • on a different webpage, I want to show the user the most recent 10 links he chose.
  • every user has his own "links feed" webpage.
  • on every "link" I want to save and show some metadata - for example: the url link itself; when it was chosen; how many times it appeared on the feed already; etc.

In this case, since the user can choose as many links he wants, whenever he wants, my app write to the datastore, much more than the number of reads (write - when the user chose another link; read - when the user opens the webpage to see his "links feed")

Question 1:
I can think of (at least) two options how to handle the data for this app:

Option A:
- maintain entity per user with the user details, registration, etc
- maintain another entity per user that holds his recent 10 chosen links, which will be rendered to the user's webpage after he asks for it

Option B:
- maintain entity per url link - which means all the urls of all users will be stored as the same object
- maintain entity per user details (same as in Option A), but add a reference to the user's urls in the big table of the urls

What will be the better method?

Question 2:
If I want to count the total numbers of urls chosen till today, or the daily amount of urls the user chose, or any other counting - should I use it with my SDK tools, or should I insert counters in the entities I described above? (I want to reduce the amount of datastore writes as much as I can)

EDIT (to answer @Elad's comment):
Assume I want to save only the 10 last urls per users. the rest of them I want to get rid of (so to not overpopulate my DB with unnecessary data).

EDIT 2: after adding the code
So I made the try with the following code (trying first Elad's method):

Here's my class:

class UserChannel(db.Model):
currentUser = db.UserProperty()
userCount = db.IntegerProperty(default=0)
currentList = db.StringListProperty() #holds the last 20-30 urls

then I serialized the url & metadata into JSON strings, which the user POSTs from the first page.
here's how the POST is dealt:

def post(self):
    user = users.get_current_user()
    if user:  
        logging messages for debugging
        self.response.headers['Content-Type'] = 'text/html'
        #self.response.out.write('<p>the user_id is: %s</p>' % user.user_id())            
        updating the new item that user adds
        current_user = UserChannel.get_by_key_name(user.nickname())
        dataJson = self.request.get('dataJson')
        #self.response.out.write('<p>the dataJson is: %s</p>' % dataJson) 
        current_user.currentPlaylist.append(dataJson)
        sizePlaylist= len(current_user.currentPlaylist)
        self.response.out.write('<p>size of currentplaylist is: %s</p>' % sizePlaylist)
        #whenever the list gets to 30 I cut it to be 20 long
        if sizePlaylist > 30:
            for i in range (0,9):
                current_user.currentPlaylist.pop(i)
        current_user.userCount +=1
        current_user.put()
        Updater().send_update(dataJson) 
    else:
        self.response.headers['Content-Type'] = 'text/html'
        self.response.out.write('user_not_logged_in')

where Updater is my method for updating with Channel-API the webpage with the feed.

Now, it all works, I can see each user has a ListProperty with 20-30 links (when it hits 30, I cut it down to 20 with the pop()), but! the prices are quite high...
each POST like the one here takes ~200ms, 121 cpu_ms, cpm_usd= 0.003588. This is very expensive considering all I do is save a string to the list...
I think the problem might be that the entity gets big with the big ListProperty?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

攒眉千度 2024-11-07 10:54:56

首先,您担心对 GAE 数据存储进行大量写入是正确的 - 我自己的经验是,与读取相比,它们非常昂贵。例如,我的一个应用程序除了在单个模型表中插入记录之外什么都不做,每天几十几千次写入就耗尽了免费配额。因此,有效地处理写入会直接转化为您的利润。

第一个问题

我不会将链接存储为单独的实体。数据存储不是 RDBMS,因此标准规范化实践不一定适用。对于每个 User 实体,使用 ListProperty 存储最新的 URL 及其元数据(您可以将所有内容序列化为字符串)。

  • 这对于写入非常有效,因为您只更新一条记录 - 每当用户添加链接时,不会更新所有链接记录。请记住,要保持滚动列表 (FIFO) 并将引用 URL 存储为单独的模型,每个新 URL 意味着两个写入操作 - 插入新 URL,以及删除最旧的 URL。
  • 它的读取效率也很高,因为对用户记录的一次读取即可为您提供呈现用户提要所需的所有数据。
  • 从存储的角度来看,世界上的 URL 总数远远超过你的用户数量(即使你成为下一个 Facebook),你的用户选择的 URL 的方差也是如此,因此平均 URL 很可能会单个用户 - RDBMS 风格的数据标准化没有真正的好处。

另一个优化想法:如果您的用户通常在短时间内添加多个链接,您可以尝试批量编写它们,而不是单独编写。使用 memcache 存储新添加的用户 URL,并使用任务队列定期将瞬态数据写入持久数据存储。我不确定使用任务的资源成本是多少 - 你必须检查一下。
这是一篇关于该主题的好文章,值得一读。

第二个问题

使用计数器。请记住,它们在分布式环境中并不是微不足道的,因此请仔细阅读 - 有许多关于该主题的 GAE 文章、食谱和博客文章 - 只需 Google Appengine 计数器。在这里,为了减少数据存储写入总数,使用内存缓存应该是一个不错的选择。

First, you're right to worry about lots of writes to GAE datastore - my own experience is that they're very expensive compared to reads. For instance, an app of mine that did nothing but insert records in a single model table reached exhausted the free quota with a few 10's of thousands of writes per day. So handling writes efficiently translates directly into your bottom line.

First Question

I wouldn't store links as separate entities. The datastore is not a RDBMS, so standard normalization practices do not necessarily apply. For each User entity, use a ListProperty to store the the most recent URLs along with their metadata (you can serialize everything into a string).

  • This is efficient for writing since you only update a single record - there are no updates to all the link records whenever the user adds links. Keep in mind that to keep a rolling list (FIFO) with references URLs stored as separate models, every new URL means two write actions - an insert of the new URL, and a delete to remove the oldest one.
  • It's also efficient for reading since a single read on the user record gives you all the data you need to render the User's feed.
  • From a storage perspective, the total number of URLs in the world far exceeds your number of users (even if you become the next Facebook), and so does the variance of URLs chosen by your users, so it's likely that the mean URL will have a single user - no real gain in RDBMS-style normalization of the data.

Another optimization idea: if your users usually add several links in a short period you can try to write them in bulk rather than separately. Use memcache to store newly added user URLs, and the Task Queue to periodically write that transient data to the persistent datastore. I'm not sure what's the resource cost of using Tasks though - you'll have to check.
Here's a good article to read on the subject.

Second Question

Use counters. Just keep in mind that they aren't trivial in a distributed environment, so read up - there are many GAE articles, recipes and blog posts on the subject - just google appengine counters. Here too, using memcache should be a good option in order to reduce the total number datastore writes.

最终幸福 2024-11-07 10:54:56

答案 1

将链接存储为单独的实体。还可以使用 ListProperty 存储每个用户的实体,该 ListProperty 具有最近 20 个链接的键。当用户选择更多链接时,您只需更新键的 ListProperty 即可。 ListProperty 维护顺序,因此只要遵循 FIFO 插入顺序,您就无需担心所选链接的时间顺序。

当您想要显示用户选择的链接(第 2 页)时,您可以执行一次 get(keys) 来在一次调用中获取所有用户的链接。

答案2

一定要保留计数器,随着实体数量的增加,统计记录的复杂度会不断增加,但有了计数器,性能将保持不变。

Answer 1

Store Links as separate entities. Also store an entity per user with a ListProperty having keys to the most recent 20 links. As user chooses more links you just update the ListProperty of keys. ListProperty maintains order so you dont need to worry about the chronological orders of links chosen as long as you follow a FIFO insertion order.

When you want to show the user's chosen links (page 2) you can do one get(keys) to fetch all the user's links in one call.

Answer 2

Definitely keep counters, as the number of entities grows, the complexity of counting records will continue to increase but with counters, the performance will remain the same.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文