测试容错代码
我目前正在开发一个服务器应用程序,我们已同意尝试并维持一定水平的服务。我们想要保证的服务级别是:如果服务器接受请求并且服务器向客户端发送确认,我们想要保证请求将会发生,即使服务器崩溃了。由于请求可能会长时间运行并且确认时间需要很短,因此我们通过持久化请求,然后向客户端发送确认,然后执行各种操作来满足请求来实现这一点。当执行操作时,它们也会被持久化,因此服务器知道启动时请求的状态,并且还有与外部系统的各种协调机制来检查日志的准确性。
这一切似乎运行得相当好,但我们很难有任何信念地说出这一点,因为我们发现测试我们的容错代码非常困难。到目前为止,我们已经提出了两种策略,但都不是完全令人满意:
- 让外部进程监视服务器代码,然后尝试在外部进程认为测试中的适当点处将其终止
- 。导致它在某些已知的关键点崩溃
我的第一个策略的问题是外部进程无法知道应用程序的确切状态,因此我们无法确定我们是否遇到了代码中最有问题的点。我对第二种策略的问题是,尽管它可以更好地控制错误发生的情况,但我不喜欢在我的应用程序中使用代码注入错误,即使有可选的编译等。我担心它太容易忽略错误注入点并将其滑入生产环境。
I’m currently working on a server application were we have agreed to try and maintain a certain level of service. The level of service we want to guaranty is: if a request is accepted by the server and the server sends on an acknowledgement to the client we want to guaranty that the request will happen, even if the server crashes. As requests can be long running and the acknowledgement time needs be short we implement this by persisting the request, then sending an acknowledgement to the client, then carrying out the various actions to fulfill the request. As actions are carried out they too are persisted, so the server knows the state of a request on start up, and there’s also various reconciliation mechanisms with external systems to check the accuracy of our logs.
This all seems to work fairly well, but we have difficult saying this with any conviction as we find it very difficult to test our fault tolerant code. So far we’ve come up with two strategies but neither is entirely satisfactory:
- Have an external process watch the server code and then try and kill it off at what the external process thinks is an appropriate point in the test
- Add code the application that will cause it to crash a certain know critical points
My problem with the first strategy is the external process cannot know the exact state of the application, so we cannot be sure we’re hitting the most problematic points in the code. My problem with the second strategy, although it gives more control over were the fault takes, is I do not like have code to inject faults within my application, even with optional compilation etc. I fear it would be too easy to over look a fault injection point and have it slip into a production environment.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为有三种方法可以解决这个问题,如果可用的话,我可以建议对这些不同的代码进行一套全面的集成测试,使用依赖项注入或工厂对象在这些集成期间产生损坏的操作。
其次,使用随机kill -9 运行应用程序,并禁用网络接口可能是测试这些事情的好方法。
我还建议测试文件系统故障。如何执行此操作取决于您的操作系统,在 Solaris 或 FreeBSD 上,我将在文件中创建 zfs 文件系统,然后在应用程序运行时 rm 该文件。
如果您使用数据库代码,那么我建议也测试数据库的故障。
依赖注入的另一种替代方案(可能也是我会使用的解决方案)是拦截器,您可以在代码中启用崩溃测试拦截器,这些拦截器将了解应用程序的状态并在正确的时间引入上面列出的故障,或者您可以使用的任何其他故障。可能想要创建。它不需要更改现有代码,只需要一些额外的代码来包装它。
I think there are three ways to deal with this, if available I could suggest a comprehensive set of integration tests for these various pieces of code, using dependency injection or factory objects to produce broken actions during these integrations.
Secondly, running the application with random kill -9's, and disabling of network interfaces may be a good way to test these things.
I would also suggest testing file system failure. How you would do that depends on your OS, on Solaris or FreeBSD I would create a zfs file system in a file, and then rm the file while the application is running.
If you are using database code, then I would suggest testing failure of the database as well.
Another alternative to dependency injection, and probably the solution I would use, are interceptors, you can enable crash test interceptors in your code, these would know the state of the application and introduce the above listed failures at the correct time, or any others you may want to create. It would not require changes to your existing code, just some additional code to wrap it.
对于第一点的一个可能的答案是在外部过程中进行多次实验,以便增加影响代码有问题部分的可能性。然后,您可以分析核心转储文件以确定代码实际崩溃的位置。
另一种方法是通过存根库或内核调用来提高可观察性和/或可命令性,即无需修改应用程序代码。
您可以在 Wikipedia 的故障注入页面上找到一些资源,位于特别是在软件实现的故障注入部分。
A possible answer to the first point is to multiply experiments with your external process so that probability to impact problematic parts of code is increased. Then you can analyze core dump file to determine where the code has actually crashed.
Another way is to increase observability and/or commandability by stubbing library or kernel calls, i.e., without modifying your application code.
You can find some resources on Fault Injection page of Wikipedia, in particular in Software Implemented Fault Injection section.
您对故障注入的担忧并不是根本问题。您只需要一种万无一失的方法来防止此类代码最终部署。一种方法是将故障注入器设计为调试器。即,错误是由进程外部的进程注入的。这已经提供了一定程度的隔离。此外,大多数操作系统都提供某种访问控制,除非专门启用,否则会阻止调试。在最原始的形式中,它是将其限制为 root,在其他操作系统上它需要特定的“调试权限”。当然,在生产中没有人会这样做,因此您的故障注入器甚至无法在生产中运行。
实际上,故障注入器可以在特定地址(即函数甚至代码行)设置断点。然后您可以对此做出反应,例如,在某个断点被击中三次后终止该过程。
Your concern about fault injection is not a fundamental concern. You merely need a foolproof way to prevent such code ending up in deployment. One way to do so is by designing your fault injector as a debugger. I.e. the faults are injected by a process external to your process. This already provides a level of isolation. Furthermore, most OS'es provide some kind of access control which prevents debugging unless specifially enabled. In the most primitive form, it's by limiting it to
root
, on other operating systems it requires a specific "debug privilege". Naturally, on production nobody will have that, and thus your fault injector cannot even run on production.Practially, the fault injector can set breakpoints at specific addresses, i.e. function or even line of code. You can then react to that, e.g. by terminating the process after a certain breakpoint is hit three times.
我正要写与贾斯汀相同的内容:)
我建议在测试期间替换的组件可能是日志记录组件(如果你有一个,如果没有,我强烈建议实现一个......)。用生成错误的代码替换它相对容易,并且记录器通常会获得足够的信息来了解当前应用程序状态。
此外,确保测试代码不会投入生产似乎是可行的。不过,我不鼓励条件编译,而是使用一些配置文件来选择日志记录组件。
使用“随机”杀戮可能有助于检测错误,但由于其不确定性,不太适合系统测试。因此我不会将它用于自动测试。
I was just about to write the same as Justin :)
The component I would suggest to replace during testing could be the logging component (if you have one, if not, I'd strongly suggest to implement one...). It's relatively easy to replace it with code that generates error and the logger usually gets enough information to know the current application state.
Also it seems to be feasible to make sure that the testing code doesn't go into production. I would discourage conditional compilation though but rather go with some configuration file to select the logging component.
Using "random" kills might help to detect errors but is not well suited for systematic testing because of its non-determinism. Therefore I wouldn't use it for automatic tests.