调试间歇性卡住的 NSOperationQueue
我有一个 iOS 应用程序,有一个非常严重的错误:我的 NSOperationQueue 中的一个操作将由于某种原因挂起并且无法完成执行,因此其他附加操作正在排队但仍然没有执行。这反过来导致应用程序无法开始执行关键功能。除了每周左右在我的一位同事设备上发生的情况之外,我还无法识别任何模式。此时从 Xcode 运行应用程序并没有帮助,因为终止并重新启动应用程序可以暂时解决问题。我尝试将调试器附加到正在运行的进程,我似乎能够看到日志数据,但我添加的任何断点都没有注册。我添加了 NSLogs 的面包屑痕迹,试图查明它挂在哪里,但这尚未得出解决方案。
我最初在另一个问题中描述了该错误,但尚未有明确的答案我'我猜测是因为我无法提供有关此问题的信息。
一位朋友曾经告诉我,可以以某种形式保存应用程序在给定时刻的整个内存堆栈,并将该确切的内存状态重新加载到不同设备上的进程上。有谁知道我怎样才能实现这一目标?如果可能的话,下次有人遇到该错误时,我可以保存确切的内存状态并复制以测试我所有可能解决方案的理论。或者有不同的方法来解决这个问题吗?作为临时措施,您认为当应用程序进入此状态时强制使应用程序崩溃以使实际用户不会感到困惑是否有意义?我对此有复杂的感觉,但用户无论如何都必须从多任务扩展坞中终止该应用程序才能再次使用该应用程序。我可以检查操作队列计数或为此创建某种超时代码,直到我真正解决这个错误。
I have an iOS app with a really nasty bug: an operation in my NSOperationQueue will for some reason hang and not finish executing so other additional operations are being queued up but still not executing. This in turn leads to the app not begin able to perform critical functions. I have not yet been able to identify any pattern other than that it occurs on one of my co-workers devices every week or so. Running the app from Xcode at that point does not help as killing and relaunching the app resolves the issue for the time being. I've tried attaching the debugger to a running process and I seem to be able to see log data but any break points I add are not registering. I've added a bread crumb trail of NSLogs to try to pinpoint where it's hanging but this has not yet led to a resolution.
I originally described the bug in another question which is yet to have a clear answer I'm guessing because of the lack of info I'm able to provide around this issue.
A friend once told me that it's possible to save the entire memory stack of an app at a given moment in some form and reload that exact state of memory onto a process on a different device. Does anyone know how I can achieve that? If that's possible the next time someone encounters that bug I can save that exact state of memory and replicate to test all my theories of possible solutions. Or is there a different approach to tackling this? As an interim measure, do you think it would make sense to forcefully make the app crash when the app enters this state so actual users would be less confused? I'm have mixed feelings about this but the user will have to kill the app from the multitask dock anyway in order to use the app again. I can check the operation queue count or create some kind of timeout code for this until I actually nail this bug.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这听起来像是一种非常罕见的竞争条件下的僵局。您还提到使用 maxConcurrentOperationCount 为 2。这意味着:
1看起来不太可能,因为队列应该允许完全阻塞 2 个并发操作,除非您使用的某些系统函数存在并发问题并阻塞您的队列而不是仅阻塞一个线程。
在这种情况下,我的第一次调试尝试是连接调试器并暂停执行。之后,您可以查看所有线程的堆栈跟踪。您应该能够找到由操作队列创建的 2 个线程,之后我将查看负责的函数以查找它们可能等待某些锁的代码。确保考虑系统功能。
This sounds as a deadlock on a very rare race-condition. You also mentioned using a maxConcurrentOperationCount of 2. This means that either:
1 seems very unlikely as the queue should allow 2 concurrent operations to be completely blocked, unless you are using some system functions that have concurency issues and block you queue instead of just one thread.
I this case my first attempt to debug would be to connect the debugger and pause execution. After that you can look at the stack traces for all threads. You should be able to find the 2 threads that are made by your operation queue after which I would review the responsible functions to find code thet might possibly wait on some lock. Make sure to take into consideration sytem functions.
好吧,解决那些不会使应用程序崩溃而只是挂起线程的错误是相当困难的。如果您无法通过逐步查看代码检查是否存在任何可能的死锁或竞争条件来找到错误,我建议您实现一些日志记录。
每次添加日志条目时将日志写入磁盘。这不是最高效的内存方式,但如果您为同事提供启用日志记录的构建,则可以在出现问题时从他的 iPhone 中提取日志。即使应用程序仍在运行。
确保记录您采取的每一步,包括您怀疑破坏应用程序的代码周围的重要变量的值。这样你就可以看到App正在做什么以及App的状态是什么。
希望这会有所帮助。我现在不打算恢复应用程序的内存状态,因此无法帮助您。
笔记;如果应用程序因错误而崩溃,您可以使用其他一些工具,但如果我做对了,情况并非如此,是吗?
我阅读了描述该错误的问题,并且尝试将当前正在运行的操作正在执行的操作记录到磁盘。似乎操作偶尔会挂起,并且存在错误。如果您可以记录运行操作时调用的方法,这将显示哪些函数调用将挂起应用程序,您可以开始在那里查找。
Well it's quite hard to solve bugs that don't crash the App but just hang a thread. If you can't find the bug by looking at your code step by step checking if there are any possible deadlock- or raceconditions I would suggest to implement some logging.
Write your log to disk everytime you add a logentry. That's not the most memory efficient way, but if you give a build with logging enabled to your co-worker you can pull the log from his iPhone when things go wrong. Even while the App is still running.
Make sure you log every step you take including the values of important variables around the code that you suspect of breaking the App. This way you can see what the App is doing and what the state of the App is.
Hope this helps a bit. I don't now about restoring the state of memory of an App so can't help you with that.
Note; If the App is crashing on the bug you could use some other tools, but if I get it right thats not the case here is it?
I read the question describing the bug and I would try to log to disk what the currently running operations are doing. It seems the operations will hang once in a while and there is a bug in there. If you can log what methods are called while running the operation this will show you what function call will hang the App and you can start looking in there.
您没有这么说,但我认为该错误是在人类操作员使用该应用程序时发生的?也许您应该向此应用程序添加自动模式,其中该应用程序模拟用户通常执行的相同操作,使用随机时间来启动不同的操作。然后,您可以让应用程序在所有设备上无人值守地运行,并增加发现问题的机会。
另外,由于问题似乎与 NSOperationQueue 有关,也许您应该将其子类化,以便可以将日志记录添加到更有趣的方法中。例如,每次添加操作时,您都应该记录队列的状态,因为您怀疑有时它会被挂起。
另外,我在您的其他问题上也建议了这一点,您可能希望设置一个观察者,以便在队列进入挂起状态时收到通知。
祝你好运。
You didn't say this but I presume the bug occurs while a human operator is working with the app? Maybe you should add an automated mode to this app, where the app simulates the same operations that users normally do, using randomized times for starting different actions. Then you can leave the app running unattended on all your devices and increase the chances of seeing the problem.
Also, since the problem appears related to the NSOperationQueue, maybe you should subclass it so that you can add logging to the more interesting methods. For example, each time an operation is added you should log the state of the queue, since you suspect that sometimes it is getting suspended.
Also, I suggested this on your other question as well, you may want to setup an observer to get notified if the queue ever goes into a suspended state.
Good luck.
在这里检查假设,因为这永远不会有坏处:您实际上有证据表明您的后台线程挂起吗?根据您的报告,观察到的行为是您放入后台线程的任务未达到您预期的结果。这并不一定表明线程已挂起,它可能只是表明特定条件意味着线程由于所有任务都已完成而关闭,而任务却没有实现您想要的结果。
补充:鉴于您在评论中的回答,在我看来下一步是在队列中开始执行某个项目时使用日志记录,以便您可以识别导致哪些项目队列被阻塞。最好的猜测是,如果它们都属于某个类别,则它是某个类别的项目或该项目的某些特征。作为执行每个项目的第一步,记录足够多的信息,以便您对该项目有一个合理的特征,然后一旦您获得进入此状态的真实设备,请检查日志并查看导致此问题的条件。这应该使您能够在调试期间或在模拟器中可靠地在设备上重现问题,然后解决问题。
换句话说,我会首先将您的注意力集中在识别有问题的操作上,而不是尝试识别出现问题的特定代码行。
Checking assumptions here, since that never hurts: do you actually have evidence that your background threads are hanging? From what you report, the observed behavior is that the tasks you're putting in your background thread are not achieving the outcome that you expected. That doesn't necessarily indicate that the thread has hung—it might just indicate that the particular conditions meant that the thread closed due to all tasks being completed, without the tasks achieving what you wanted them to.
Addition: Given your answer in the comments, it seems to me the next step then is to use logging when an item begins to be executed in the queue so that you can identify which items it is that lead to the queue becoming blocked. Best guess is that it is a certain class of items or certain characteristics of the items if they are all of a certain class. Log enough as the first step of executing each item that you'll have a reasonable characterization of the item, and then once you get a real device that has entered this state, check the logs and see just what conditions are leading to this problem. That should enable you to reliably reproduce the problem on a device during debugging or in the simulator, to then nail it.
In other words—I would focus your attention on identifying the problematic operations first, rather than trying to identify the particular line of code where things are stalling.
就我而言
而不是
必须被覆盖。
如有疑问,请参阅 https://developer.apple.com/documentation/ Foundation/nsoperation#1661262?language=objc 与您的实现存在差异
In my case
instead of
had to be overridden.
When in doubt consult https://developer.apple.com/documentation/foundation/nsoperation#1661262?language=objc for discrepancies with your implementation