昨天我们在尝试更换舞台时遇到了可怕的问题/经历<-->生产作用。
这是我们的设置:
我们有一个工作角色从队列中提取消息。这些消息在角色上进行处理。 (表存储插入、数据库选择等)。每个队列消息可能需要 1-3 秒,具体取决于他需要发出多少表存储帖子。当一切完成后,他会删除这条消息。
交换时出现问题:
当我们的暂存项目上线时,我们的生产工作者角色开始出错。
当角色想要处理队列消息时,它会给出持续的“EntityAlreadyExists”错误流。由于这些错误,队列消息未被删除。这导致队列消息被放回队列中并返回处理等等......
当查看这些队列消息并分析它们会发生什么时,我们看到它们实际上已被处理但没有被删除。
删除这些错误消息后问题还没有解决。新的队列消息也没有被处理,而这些消息还没有被处理,也没有添加表存储记录,这听起来很奇怪。
当删除暂存和生产并再次发布到生产时,一切都开始正常工作。
可能的问题?
我们几乎不知道到底发生了什么。
- 也许两个角色都收到了相同的消息,一个发布了帖子,另一个角色出错了?
- ...???
可能的解决方案?
我们对如何解决这个“问题”有一些想法。
- 使有毒消息故障转移系统?当出队计数超过 X 时,我们应该删除该队列消息或将其放入单独的“毒队列”中。
- 捕获 EntityAlreadyExists 错误并删除该队列消息或将其放入单独的队列中。
- ...????
多个角色
我想我们在设置多个角色时也会遇到同样的问题吗?
非常感谢。
编辑 24/02/2012 - 额外信息
- 我们实际上使用 GetMessage()
- 队列中的每个项目都是唯一的,并将在表存储中生成唯一的消息。有关该过程的更多信息:用户发布了某些内容,并且必须将其分发给某些其他用户。该用户生成的消息将具有唯一的 ID (guid)。该消息将被发布到队列中并由辅助角色拾取。消息分布在其他几个表中(partitionkey -> UserId、rowkey -> 一些时间戳记和唯一的消息 ID。因此,在正常情况下几乎不可能发布相同的消息。不可见
- 时间out 可能是一个合乎逻辑的解释,因为某些消息可以分发到 10-20 个表,这意味着 10-20 个插入没有批处理选项。超时?
- 由于异常而没有删除队列消息也可以是一个解释,因为我们还没有实现任何有害消息故障转移;)。
We had a terrible problem/experience yesterday when trying to swap our staging <--> production role.
Here is our setup:
We have a workerrole picking up messages from the queue. These messages are processed on the role. (Table Storage inserts, db selects etc ). This can take maybe 1-3 seconds per queue message depending on how many table storage posts he needs to make. He will delete the message when everything is finished.
Problem when swapping:
When our staging project went online our production workerrole started erroring.
When the role wanted to process queue messsage it gave a constant stream of 'EntityAlreadyExists' errors. Because of these errors queue messages weren't getting deleted. This caused the queue messages to be put back in the queue and back to processing and so on....
When looking inside these queue messages and analysing what would happend with them we saw they were actually processed but not deleted.
The problem wasn't over when deleting these faulty messages. Newly queue messages weren't processed as well while these weren't processed yet and no table storage records were added, which sounds very strange.
When deleting both staging and producting and publishing to production again everything started to work just fine.
Possible problem(s)?
We have litle 2 no idea what happened actually.
- Maybe both the roles picked up the same messages and one did the post and one errored?
- ...???
Possible solution(s)?
We have some idea's on how to solve this 'problem'.
- Make a poison message fail over system? When the dequeue count gets over X we should just delete that queue message or place it into a separate 'poisonqueue'.
- Catch the EntityAlreadyExists error and just delete that queue message or put it in a separate queue.
- ...????
Multiple roles
I suppose we will have the same problem when putting up multiple roles?
Many thanks.
EDIT 24/02/2012 - Extra information
- We actually use the GetMessage()
- Every item in the queue is unique and will generate unique messages in table Storage. Little more information about the process: A user posts something and will have to be distributed to certain other users. The message generate from that user will have a unique Id (guid). This message will be posted into the queue and picked up by the worker role. The message is distributed over several other tables (partitionkey -> UserId, rowkey -> Some timestamp in ticks & the unique message id. So there is almost no chance the same messages will be posted in a normal situation.
- The invisibility time out COULD be a logical explanation because some messages could be distributed to like 10-20 tables. This means 10-20 insert without the batch option. Can you set or expand this invisibility time out?
- Not deleting the queue message because of an exception COULD be a explanation as well because we didn't implement any poison message fail over YET ;).
发布评论
评论(5)
无论登台与生产问题如何,拥有处理有害消息的机制至关重要。我们在 Azure 队列上实现了一个抽象层,一旦尝试处理消息达到可配置的次数,该抽象层就会自动将消息移至有害队列。
Regardless of the Staging vs. Production issue, having a mechanism that handles poison messages is critical. We've implemented an abstraction layer over Azure queues that automatically moves messages over to a poison queue once they've been attempted to be processed some configurable amount of times.
您显然在处理双重消息时存在错误。您的 ID 是唯一的这一事实并不意味着该消息在某些情况下不会被处理两次,例如:
在所有情况下,您都需要代码来处理消息将会重新出现。一种方法是使用 DequeueCount属性并检查消息从队列中删除并接收处理的次数。确保您拥有可处理消息的部分处理的代码。
现在,在交换期间可能发生的情况是,当生产环境变成暂存环境并且暂存环境变成生产环境时,它们都试图接收相同的消息,因此它们基本上是在相互竞争这些消息,这可能还不错,因为这是一个无论如何,已知的模式都可以工作,但是当您终止旧的生产(暂存)时,收到的用于处理且未完成的每条消息最终都会回到队列中,并且您的新生产环境会再次选择该消息进行处理。由于没有代码逻辑来处理这种情况,并且消息已部分处理,因此表中存在一些记录,并且它开始导致您注意到的行为。
You clearly have a fault on handling double messages. The fact that your ID is unique doesn't mean that the message will not be processed twice in some occasions like:
In all cases, you need code that handles the fact that the message will re-appear. One way is to use the DequeueCount property and check how many times the message was removed from a Queue and received for processing. Make sure you have code that handles partial processing of a message.
Now what probably happened during swapping was, when the production environment became the staging and staging became production, both of them were trying to receive the same messages so they were basically competing each other fro those messages, which is probably not bad because this is a known pattern to work anyway but when you killed your old production (staging) every message that was received for processing and wasn't finished, ended up back in the Queue and your new production environment picked the message for processing again. Having no code logic to handle this scenario and a message was that partially processed, some records in the tables existed and it started causing the behavior you noticed.
有几个可能的原因:
您如何读取队列消息?如果您正在执行“查看消息”,则在删除消息之前,该消息仍然可见,可以被另一个角色实例(或您的暂存环境)拾取。您需要确保使用“获取消息”,以便消息在被删除之前不可见。
您的第一个角色是否有可能在完成消息工作后但在删除消息之前崩溃?这将导致消息再次变得可见并被另一个角色实例拾取。那时,该消息将是一条有害消息,将导致您的实例不断崩溃。
这个问题几乎肯定与暂存与生产无关,但很可能是由多个实例从同一队列读取数据引起的。您可以通过指定 2 个实例,或者将相同的代码部署到 2 个不同的生产服务,或者使用 2 个实例在开发计算机(仍然指向 Azure 存储)上本地运行代码来重现相同的问题。
一般来说,您确实需要处理有害消息,因此无论如何您都需要实现该逻辑,但我建议首先找到此问题的根本原因,否则您稍后会遇到更多问题。
There are a few possible causes:
How are you reading the queue messages? If you are doing a Peek Message then the message will still be visible to be picked up by another role instance (or your staging environment) before the message is deleted. You want to make sure you are using Get Message so the message is invisible until it can be deleted.
Is it possible that your first role crashed after doing the work for the message but prior to deleting the message? This would cause the message to become visible again and get picked up by another role instance. At that point the message will be a poison message which will cause your instances to constantly crash.
This problem almost certainly has nothing to do with Staging vs Production, but is most likely caused by having multiple instances reading from the same queue. You can probably reproduce the same problem by specifying 2 instances, or by deploying the same code to 2 different production services, or by running the code locally on your dev machine (still pointing to Azure storage) using 2 instances.
In general you do need to handle poison messages so you need to implement that logic anyways, but I would suggest getting to the root cause of this problem first, otherwise you are just going to run into a lot more problems later on.
对于队列,您需要在编码时考虑幂等性,并期望并处理“EntityAlreadyExists”作为可行的响应。
正如其他人所建议的,原因可能是
不查看代码,我猜测正在发生的是 3 或 4 选项。
如果您无法通过代码审查检测到问题,您可以考虑添加基于时间的日志记录和 try/catch 包装器以更好地理解。
在多角色环境中有效地使用队列需要稍微不同的心态,尽早遇到此类问题实际上是因祸得福。
附录 2/24
只是澄清一下,修改不可见超时并不是此类问题的通用解决方案。另请注意,此功能虽然在 REST API 上可用,但在队列客户端上可能不可用。
其他选项包括以异步方式写入表存储以加快处理时间,但这又是一种权宜之计,并不能真正解决使用队列的底层范例。
因此,底线是幂等的。您可以尝试使用表存储 upsert(更新或插入)功能来避免出现“EntitiyAlreadyExists”错误(如果这适用于您的代码)。如果您所做的只是将新实体插入到 azure 表存储中,那么 upsert 应该可以通过最少的代码更改来解决您的问题。
如果你正在进行更新,那么这完全是一场不同的比赛。一种模式是将更新与具有相同分区键的同一表中的虚拟插入配对,以便在更新之前发生时出错,从而跳过更新。稍后删除消息后,您可以删除虚拟插入。然而,所有这些都增加了复杂性,因此最好重新审视产品的架构;例如,您真的需要插入/更新这么多表吗?
With queues you need to code with idempotency in mind and expect and handle the ‘EntityAlreadyExists’ as a viable response.
As others have suggested, causes could be
Without looking at the code I am guessing that it is either the 3 or 4 option that is occurring.
If you cannot detect the issue with a code review, you may consider adding time based logging and try/catch wrappers to get a better understanding.
Using queues effectively, in a multi-role environment, requires a slightly different mindset and running into such issues early is actually a blessing in disguise.
Appended 2/24
Just to clarify, modifying the invisibility time out is not a generic solution to this type of problem. Also, note that this feature although available on the REST API, may not be available on the queue client.
Other options involve writing to table storage in an asynchronous manner to speed up your processing time, but again this is a stop gap measures which does not really address the underlying paradigm of working with queues.
So, the bottom line is to be idempotent. You can try using the table storage upsert (update or insert) feature to avoid getting the ‘EntitiyAlreadyExists’ error, if that works for your code. If all you are doing is inserting new entities to azure table storage then the upsert should solve your problem with minimal code change.
If you are doing updates then it is a different ball game all together. One pattern is to pair updates with dummy inserts in the same table with the same partition key so as to error out if the update occurred previously and so skip the update. Later after the message is deleted, you can delete the dummy inserts. However, all this adds to the complexity, so it is much better to revisit the architecture of the product; for example, do you really need to insert/update into so many tables?
在不知道您的辅助角色实际在做什么的情况下,我在这里进行猜测,但听起来当您运行两个辅助角色实例时,您在尝试写入 Azure 表时会遇到冲突。这可能是因为您的代码看起来像这样:
如果队列中有两条具有相同 FooId 的相邻消息,那么您很可能最终会得到这两个实例检查
Foo
是否存在,没有找到则尝试创建它。无论哪个实例最后尝试保存该项目,都会收到“实体已存在”错误。因为它出错了,所以它永远不会到达代码的删除消息部分,因此在一段时间后它会重新出现在队列中。正如其他人所说,处理有害消息确实是个好主意。
2002 年 27 月更新
如果它不是后续消息(根据您的分区/行键方案,我会说它不太可能),那么我的下一个赌注将是在可见性超时后出现在队列中的相同消息。默认情况下,如果您使用 .GetMessage(),超时为 30秒。它有一个重载,允许您指定该时间范围的长度。还有 .UpdateMessage() 函数 允许您在处理消息时更新超时。例如,您可以将初始可见性设置为 1 分钟,然后如果 50 秒后您仍在处理消息,请将其延长一分钟。
Without knowing what your worker role is actually doing I'm taking a guess here, but it sounds like when you have two instances of your worker role running you are getting conflicts while trying to write to an Azure table. It is likely to be because you have code that looks something like this:
If you have two adjacent messages in the queue with the same
FooId
it is quite likely that you'll end up with both of the instances checking to see if theFoo
exists, not finding it then trying to create it. Whichever instance is the last to try and save the item will get the "Entity already exists" error. Because it errored it never gets to the delete message part of the code and therefore it becomes visible back on the queue after a period of time.As others have said, dealing with poison messages is a really good idea.
Update 27/02
If it's not subsequent messages (which based on your partition/row key scheme I would say it's unlikely), then my next bet would be it's the same message appearing back in the queue after the visibility timeout. By default if you're using .GetMessage() the timeout is 30 seconds. It has an overload which allows you to specify how long that time frame is. There is also the .UpdateMessage() function that allows you to update that timeout as you're processing the message. For example you could set the initial visibility to 1 minute, then if you're still processing the message 50 seconds later, extent it for another minute.