Google App Engine 中的非规范化?
背景::::
我正在使用适用于 Java 的谷歌应用程序引擎 (GAE)。我正在努力设计一个能够发挥大表优点和缺点的数据模型,这是之前的两篇相关文章:
我已经暂时决定采用完全规范化的主干,并将非规范化属性添加到实体中,以便大多数客户端请求只需一个查询即可得到服务。
我认为完全规范化的主干将:
- 如果我在非规范化中编码错误,则有助于维护数据完整性
- 从客户端的角度启用一次操作中的写入
- 是愿意等待)
允许对数据进行任何类型的意外查询(前提 非规范化数据将:
- 使大多数客户端请求能够非常快速地得到服务
基本非规范化技术:::
我观看了一个应用程序引擎视频,该视频描述了一种称为“扇出”的技术。这个想法是快速写入规范化数据,然后使用任务队列在幕后完成非规范化,而客户端无需等待。我在这里附上了视频供参考,但它长达一个小时,没有必要观看它来理解这个问题: http://code.google。 com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
如果我使用这种“扇出”技术,每次客户端修改一些数据时,应用程序都会更新规范化的数据一次快速写入模型,然后将非规范化指令发送到任务队列,这样客户端就不必等待它们完成。
问题:::
使用任务队列更新数据的非规范化版本的问题是,在任务队列完成数据的非规范化之前,客户端可能会对刚刚修改的数据发出读取请求。这将为客户提供与他们最近的请求不一致的陈旧数据,从而使客户感到困惑并使应用程序出现错误。
作为补救措施,我建议通过 URLFetch 对应用程序中的其他 URL 进行异步调用,从而并行展开非规范化操作: http://code.google.com/appengine/docs/java/urlfetch/ 应用程序将等到所有异步调用完成后才响应客户端请求。
例如,如果我有一个“约会”实体和一个“客户”实体。每个预约都将包含其预约对象的客户信息的非规范化副本。如果客户更改了名字,应用程序将进行 30 次异步调用;一个到每个受影响的预约资源,以便更改每个资源中客户名字的副本。
理论上,这一切都可以并行完成。所有这些信息大约可以在对数据存储进行 1 或 2 次写入所需的时间内更新。反规范化完成后可以及时向客户端做出响应,消除客户端接触到不一致数据的可能性。
我认为最大的潜在问题是应用程序在任何时间都不能有超过 10 个异步请求调用(记录在此处):http://code.google.com/appengine/docs/java/urlfetch/overview.html)。
建议的非规范化技术(递归异步扇出):::
我建议的补救措施是将非规范化指令发送到另一个资源,该资源递归地将指令分割成大小相等的较小块,并以较小的块作为参数调用自身,直到达到指令数量为止。每个块都足够小,可以直接执行。例如,如果具有 30 个关联约会的客户更改了其名字的拼写。我会致电非规范化资源,并提供更新所有 30 个约会的说明。然后,它将这些指令分成 10 组,每组 3 条指令,并使用每组 3 条指令向自己的 URL 发出 10 个异步请求。一旦指令集少于 10 条,资源就会根据每条指令直接发出异步请求。
我对这种方法的担忧是:
- 它可能被解释为试图规避应用程序引擎的规则,这会导致问题。 (它甚至不允许 URL 调用自身,因此实际上我必须有两个 URL 资源来处理相互调用的递归)
- 它很复杂,存在多个潜在故障点。
我非常感谢有关此方法的一些意见。
Background::::
I'm working with google app engine (GAE) for Java. I'm struggling to design a data model that plays to big table's strengths and weaknesses, these are two previous related posts:
I've tentatively decided on a fully normalized backbone with denormalized properties added into entities so that most client requests can be serviced with only one query.
I reason that a fully normalized backbone will:
- Help maintain data integrity if I code a mistake in the denormalization
- Enable writes in one operation from a client's perspective
- Allow for any type of unanticipated query on the data (provided one is willing to wait)
While the denormalized data will:
- Enable most client requests to be serviced very fast
Basic denormalization technique:::
I watched an app engine video describing a technique referred to as "fan-out." The idea is to make quick writes to normalized data and then use the task queue to finish up the denormalization behind the scenes without the client having to wait. I've included the video here for reference, but its an hour long and theres no need to watch it in order to understand this question:
http://code.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
If I use this "fan-out" technique, every time the client modifies some data, the application would update the normalized model in one quick write and then fire off the denormalization instructions to the task queue so the client does not have to wait for them to complete as well.
Problem:::
The problem with using the task queue to update the denormalized version of the data is that the client could make a read request on data that they just modified before the task queue has completed the denormalization on that data. This would provide the client with stale data that is incongruent with their recent request confusing the client and making the application appear buggy.
As a remedy, I propose fanning out denormalization operations in parallel via asynchronous calls to other URLS in the application via URLFetch: http://code.google.com/appengine/docs/java/urlfetch/ The application would wait until all of the asynchronous calls had been completed before responding to the client request.
For example, if I have an "Appointment" entity and a "Customer" entity. Each appointment would include a denormalized copy of the customer information for who its scheduled for. If a customer changed their first name, the application would make 30 asynchronous calls; one to each affected appointment resource in order to change the copy of the customer's first name in each one.
In theory, this could all be done in parallel. All of this information could be updated in roughly the time it takes to make 1 or 2 writes to the datastore. A timely response could be made to the client after the denormalization was completed eliminating the possibility of the client being exposed to incongruent data.
The biggest potential problem I see with this is that the application can not have more than 10 asynchronous request calls going at any one time (documented here): http://code.google.com/appengine/docs/java/urlfetch/overview.html).
Proposed denormalization technique (recursive asynchronous fan-out):::
My proposed remedy is to send denormalization instructions to another resource that recursively splits the instructions into equal-sized smaller chunks, calling itself with the smaller chunks as parameters until the number of instructions in each chunk is small enough to be executed outright. For example, if a customer with 30 associated appointments changed the spelling of their first name. I'd call the denormalization resource with instructions to update all 30 appointments. It would then split those instructions up into 10 sets of 3 instructions and make 10 asynchronous requests to its own URL with each set of 3 instructions. Once the instruction set was less than 10, the resource would then make asynchronous requests outright as per each instruction.
My concerns with this approach are:
- It could be interpreted as an attempt to circumvent app engine's rules, which would cause problems. (its not even allowed for a URL to call itself, so I'd in fact have to have two URL resources that handle the recursion that would call each other)
- It is complex with multiple points of potential failure.
I'd really appreciate some input on this approach.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这听起来非常复杂,设计越复杂,编码和维护就越困难。
假设您需要对数据进行非规范化,我建议仅使用基本的非规范化技术,但要跟踪正在更新的对象。如果客户端请求一个正在更新的对象,你知道你需要查询数据库来获取更新的数据;如果没有,您可以依赖非规范化数据。一旦任务队列完成,它就可以从“正在更新”列表中删除该对象,并且一切都可以依赖于非规范化的数据。
复杂的版本甚至可以跟踪每个对象的编辑时间,因此给定的对象将知道它是否已被任务队列更新。
This sounds awfully complicated, and the more complicated the design the more difficult it is to code and maintain.
Assuming you need to denormalize your data, I'd suggest just using the basic denormalization technique, but keep track of which objects are being updated. If a client requests an object which is being updated, you know you need to query the database to get the updated data; if not, you can rely on the denormalized data. Once the task queue finishes, it can remove the object from the "being updated" list, and everything can rely on the denormalized data.
A sophisticated version could even track when each object was edited, so a given object would know if it had already been updated by the task queue.
听起来您正在重新实现物化视图 http://en.wikipedia.org/wiki/Materialized_view< /a>.
It sounds like you are re-implemeting Materialized Views http://en.wikipedia.org/wiki/Materialized_view.
我建议您使用 Memcache 的简单解决方案。从客户端更新后,您可以将实体保存在 Memcache 中,存储更新实体的密钥,状态为“正在更新”。当您的任务完成时,它将删除 Memcached 状态。然后,您可以在读取之前检查状态,从而可以正确通知用户实体是否仍处于“锁定”状态。
I suggest you the easy solution with Memcache. Uppon update from your client, you could save an Entity in the Memcache storing the Key of the updated Entity with the status 'updating'. When you task finisches, it will delete the Memcached status. Then you would check the status before a read, allowing the user to be correctly informed if the Entity is still 'locked'.