我可以对 Azure 上的全球时间做出哪些假设?
我希望我的 Azure 角色在突然发生故障时重新处理数据。我考虑以下选择。
对于要处理的每个数据块,我都有一个数据库表行,并且我可以添加一列,表示“来自处理节点的最后一次 ping 的时间”。因此,当节点获取数据块进行处理时,它会将“处理”状态和时间设置为“当前时间”,然后节点有责任每隔一分钟更新一次该时间。然后,某个节点会定期询问“所有具有处理状态和 ping 时间大于 10 分钟的块”,并将这些块视为已放弃,并以某种方式将它们排队等待重新处理。
我有一个非常严重的担忧。上述方法要求节点具有或多或少相同的时间。我可以依赖所有 Azure 节点具有相同的时间并具有一定的合理精度(例如几秒)吗?
I want my Azure role to reprocess data in case of sudden failures. I consider the following option.
For every block of data to process I have a database table row and I could add a column meaning "time of last ping from a processing node". So when a node grabs a data block for processing it sets "processing" state and that time to "current time" and then it's the node responsibility to update that time say every one minute. Then periodically some node will ask for "all blocks that have processing state and ping time larger than ten minutes" and consider those blocks as abandoned and somehow queue them for reprocessing.
I have one very serious concern. The above approach requires that nodes have more or less the same time. Can I rely on all Azure nodes having the same time with some reasonable precision (say several seconds)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对于 2 小时以下的处理时间,您通常可以依赖队列语义(可见性超时)。如果您将数据存储在 Blob 存储中,则可以让工作人员弹出一条队列消息,其中包含要处理的 Blob 名称,并为该消息设置合理的可见性超时(目前最多 2 小时)。一旦完成工作,它就可以删除队列消息。如果失败,则永远不会调用删除,并且在可见性超时后,它将重新出现在队列中以进行重新处理。顺便说一句,这就是为什么你希望你的工作是幂等的。
对于持续时间超过两个小时的处理,我通常建议采用租赁策略,其中工作人员使用 Windows Azure Blob 存储中的固有租赁功能来租赁基础 Blob 数据(如果可能,否则为虚拟 Blob)。当工作人员去检索文件时,它会尝试租用该文件。已租用的文件表示当前正在处理该文件的辅助角色。如果发生故障,租约将被破坏,并且它将可供另一个实例租用。租约必须每分钟左右更新一次,但可以无限期保留。
当然,您将要处理的数据保存在 blob 存储中,对吧? :)
正如已经指出的,您不应依赖 VM 节点之间的同步时间。如果您出于任何原因存储日期时间 - 请使用 UTC,否则稍后您会后悔的。
For processing times under 2 hrs, you can usually rely on queue semantics (visibility timeout). If you have the data stored in blob storage, you can have a worker pop a queue message containing the name of the blob to work on and set a reasonable visibility timeout on the message (up to 2 hrs today). Once it completes the work, it can delete the queue message. If it fails, the delete is never called and after the visibility timeout, it will reappear on the queue for reprocessing. This is why you want your work to be idempotent, btw.
For processing that lasts longer than two hours, I generally recommend a leasing strategy where the worker leases the underlying blob data (if possible or a dummy blob otherwise) using the intrisic lease functionality in Windows Azure blob storage. When a worker goes to retrieve a file, it tries to lease it. A file that is already leased is indicative of a worker role currently processing it. If failure occurs, the lease will be broken and it will become leasable by another instance. Leases must be renewed every min or so, but they can be held indefinitely.
Of course, you are keeping the data to be processed in blob storage, right? :)
As already indicated, you should not rely on synchronized times between VM nodes. If you store datetimes for any reason - use UTC or you will be sorry later.
这里的答案不是使用基于时间的同步(如果您愿意,请确保使用 UTCNow),但仍然不能保证时钟在任何地方都同步。也不应该有。
对于您所描述的问题,基于队列的系统就是答案。我已经参考了很多它,并将再次这样做,但我已经解释了基于队列的系统的一些好处 在我的博客文章中。
这个想法如下:
按照您的方法,我将使用 AppFabric 队列,因为您还可以拥有主题和消息订阅,允许您监视数据项。我的博客文章中的示例涵盖了这个确切的场景,唯一的区别是我没有使用辅助角色,而是从我的 Web 应用程序轮询队列。但概念是一样的。
The answer here isn't to use time based synchronization (if you would however, make sure you use UTCNow), but there is still no guarantee anywhere that the clocks are synced. Nor should there be.
For the problem you are describing a queue based system is the answer. I've been referencing a lot to it, and will do it again, but I've explained some benefits of queue based systems in my blog post.
The idea is the following:
With your approach I would use AppFabric Queues because you can also have topics & subscriptions, which allows you to monitor the data items. The example in my blog post coveres this exact scenario, with the only difference being that instead of having a worker role I poll the queue from my web application. But the concept is the same.
我会使用队列存储尝试不同的方式。如果您在超时的情况下将数据块弹出到队列中,则让您的处理节点(工作角色?)将该数据从队列中拉出。
数据从队列中弹出后,如果处理节点没有从队列中删除该条目,则在超时时间后,该条目将重新出现在队列中进行处理。
I would try this a different way using queue storage. If you pop your block of data on a queue with a timeout then have your processing nodes (worker roles?) pull this data off the queue.
After the data is popped off the queue if the processing node does not delete the entry from the queue it will reappear on the queue for processing after the timeout period.
远程桌面进入角色实例并检查 (a) 时区(我认为是 UTC),以及 (b) 在日期和时间设置中启用了互联网时间。如果是这样,那么您可以相信它们之间的间隔不超过几毫秒。 (这并不是说使用消息队列的建议不起作用,但也许它们不适合您的需求。)
Remote desktop into a role instance and check (a) the time zone (UTC, I think), and (b) that Internet Time is enabled in Date and Time settings. If so then you can rely on them being no more than a few ms apart. (This is not to say that the suggestions to use a message queue instead won't work, but perhaps they do not suit your needs.)