在这些情况下,如何检测意外的辅助角色故障并重新处理数据?

发布于 2024-11-08 14:29:03 字数 445 浏览 0 评论 0原文

我想创建一个在 Windows Azure 中托管的 Web 服务。客户端将上传文件进行处理,云将处理这些文件,生成结果文件,客户端将下载它们。

我想我将使用 Web 角色来处理 HTTP 请求,使用辅助角色来进行实际处理,并使用 Azure 队列或 Azure 表存储之类的东西来跟踪请求。让我们假设它是 Azure 表存储 - 每个用户上传的文件一个“请求”记录。

一个主要的设计问题是处理单个文件可能需要一秒到十个小时不等的时间。

因此,我预计会出现以下情况:启动辅助角色,访问 Azure 表存储,找到标记为“准备处理”的请求,将其标记为“正在处理”,开始实际处理。通常它会处理文件并将请求标记为“已处理”,但是如果它意外终止怎么办?

除非我处理它,否则该请求将永远保持在“正在处理”状态。

如何跟踪标记为“正在处理”但已放弃的请求? Windows Azure 中的哪种机制对此最方便?

I want to create a web service hosted in Windows Azure. The clients will upload files for processing, the cloud will process those files, produce resulting files, the client will download them.

I guess I'll use web roles for handling HTTP requests and worker roles for actual processing and something like Azure Queue or Azure Table Storage for tracking requests. Let's pretend it'll be Azure Table Storage - one "request" record per user uploaded file.

A major design problem is processing a single file can take anywhere from one second to say ten hours.

So I expect the following case: a worker role is started, gets to Azure Table Storage, finds a request marked "ready for processing", marks it "is being processed", starts actual processing. Normally it would process the file and mark the request "processed", but what if it dies unexpectedly?

Unless I take care of it the request will remain in "is being processed" state forever.

How do I track requests that are marked "is being processed" but abandoned? What mechanism in Windows Azure would be most convenient for that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

抚你发端 2024-11-15 14:29:03

您遇到的主要问题是队列目前无法设置大于 2 小时的可见性超时。因此,您需要另一种机制来指示活动工作正在进行中。我建议进行团块租赁。对于您处理的每个文件,您要么租用 blob 本身,要么租用 0 字节标记 blob。您的工作人员扫描可用的 blob 并尝试租用它们。如果他们获得租约,则意味着该租约尚未得到处理,他们会继续处理。如果他们未能履行租约,则必须由另一名工人积极处理。

一旦工作进程完成对文件的处理,它只需将文件复制到 blob 存储中的另一个容器中(或者根据需要将其删除),这样就不会再次扫描该文件。

在队列消息可以更新之前,租约实际上是您唯一的答案。

编辑:我应该澄清一下,租约在这里起作用的原因是租约必须每 30 秒左右主动维护一次,因此您有一个非常小的窗口,您可以在其中知道某人是否已经死亡或正在死亡。仍在努力。

The main issue you have is that queues cannot set a visibility timeout larger than 2 hrs today. So, you need another mechanism to indicate that active work is in progress. I would suggest a blob lease. For every file you process, you either lease the blob itself or a 0-byte marker blob. Your workers scan the available blobs and attempt to lease them. If they get the lease, it means it is not being processed and they go ahead and process. If they fail the lease, another worker must actively be working on it.

Once the worker has completed processing the file, it simply copies the file into another container in blob storage (or deletes it if you wish) so that it is not scanned again.

Leases are really your only answer here until queue messages can be renewed.

edit: I should clarify that the reason that leases would work here is that a lease must be actively maintained every 30 seconds or so, so you have a very small window where you know if someone has died or is still working on it.

云淡风轻 2024-11-15 14:29:03

我相信这个问题不是特定于技术的。
由于您的处理作业需要长时间运行,因此我建议这些作业应在执行期间报告其进度。通过这种方式,在相当长的时间内没有报告进度的作业就成为清理的明确候选者,然后可以在另一个工作角色上重新启动。
如何记录进度和进行工作交换取决于您。一种方法是使用数据库作为记录机制并创建一个代理工作进程来 ping 作业进度表。如果工作进程发现任何问题,它可以采取纠正措施。

其他方法是将工作人员角色标识与长期运行的流程相关联。工作人员角色可以使用某种心跳来传达他们的健康状况。
如果作业运行时间不长,您可以标记作业的开始时间,而不是状态标记,并可以使用超时机制来确定处理是否失败。

I believe this problem is non technology specific.
Since your processing jobs are long running, I suggest these jobs should report their progress during execution. In this way a job which has not reported progress for a substantial substantial duration becomes a clear candidate for cleanup and then can be restarted on another worker role.
How you record progress and do job swapping is upto you. One approach is to use database as recording mechanism and creating an agent worker process that pings the job progress table. In case the worker process determines any problems it can take corrective actions.

Other approach would be to associate the worker role identification with the long running process. The worker roles can communicate their health status using some sort of heart beat.
Had the jobs not been long running you could have flagged the start time of job instead on status flag and could have used the timeout mechanism to determine whether the processing has failed.

嘿嘿嘿 2024-11-15 14:29:03

您描述的问题最好使用Azure队列来处理,因为Azure表存储不会为您提供任何类型的管理机制。

使用 Azure 队列,您可以在获取队列项目时设置超时(默认值:30 秒)。一旦您读取了一个队列项(例如“处理文件 x 在 url y 处的 blob 中等待您”),该队列项在指定的时间段内将变得不可见。这意味着其他辅助角色实例不会同时尝试获取它。完成处理后,您只需删除队列项目即可。

现在:假设您即将完成并且尚未删除队列项目。突然,您的角色实例意外崩溃(或者硬件出现故障,或者由于某种原因重新启动)。队列项处理代码现已停止。最终,当自最初读取队列项以来经过一段时间(相当于您设置的超时值)时,该队列项将再次可见。您的辅助角色实例之一将再次读取队列项并可以处理它。

需要记住的一些事情:

  • 队列项目有一个出队计数。请注意这一点。一旦特定队列项目的出列次数达到一定数量(我喜欢使用 3 次作为限制),您应该将该队列项目移动到“毒药队列”或表存储以进行离线评估 - 可能会出现问题消息或处理该消息的过程。
  • 确保您的处理是幂等的(例如,您可以多次处理同一消息而不会产生任何副作用)
  • 因为队列项可以变得不可见,然后稍后返回可见性,所以队列项不必得到按 FIFO 顺序处理。

编辑:Per Ryan 的回答 - Azure 队列消息在 2 小时超时时达到最大值。服务总线队列消息的超时时间要长得多。该功能几天前刚刚通过 CTP。

The problem you describe is best handled with Azure Queues, as Azure Table Storage won't give you any type of management mechanism.

Using Azure Queues, you set a timeout when you get an item of the queue (default: 30 seconds). Once you read a queue item (e.g. "process file x waiting for you in blob at url y"), that queue item becomes invisible for the time period specified. This means that other worker role instances won't try to grab it at the same time. Once you complete processing, you simply delete the queue item.

Now: Let's say you're almost done and haven't deleted the queue item yet. All of a sudden, your role instance unexpectedly crashes (or the hardware fails, or you're rebooted for some reason). The queue-item processing code has now stopped. Eventually, when time passes since originally reading the queue item, equivalent to the timeout value you set, that queue item becomes visible again. One of your worker role instances will once again read the queue item and can process it.

A few things to keep in mind:

  • Queue items have a dequeue count. Pay attention to this. Once you hit a certain number of dequeues for a specific queue item (I like to use 3 times as my limit), you should move this queue item to a 'poison queue' or table storage for offline evaluation - there could be something wrong with the message or the process around handling that message.
  • Make sure your processing is idempotent (e.g. you can process the same message multiple times with no side-effects)
  • Because a queue item can go invisible and then return to visibility later, queue items don't necessary get processed in FIFO order.

EDIT: Per Ryan's answer - Azure queue messages max out at a 2-hour timeout. Service Bus queue messages have a far-greater timeout. This feature just went CTP a few days ago.

吻风 2024-11-15 14:29:03

您的角色的 OnStop() 可能是解决方案的一部分,但在某些情况下(硬件故障)不会调用它。为了解决这种情况,让您的 OnStart() 将具有相同 RoleInstanceID 的所有内容标记为已放弃,因为如果仍然发生任何事情,则不会调用它。 (幸运的是,您可以观察到 Azure 重复使用其角色实例 ID。)

Your role's OnStop() could be part of the solution, but there are some circumstances (hardware failure) where it won't get called. To cover that case, have your OnStart() mark everything with the same RoleInstanceID as abandoned, because it wouldn't be called if anything was still happening. (You can observe that Azure reuses its role instance IDs, luckily.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文