揭开汇编语言意大利面条式代码的面纱

发布于 2024-07-24 07:54:54 字数 729 浏览 13 评论 0原文

我继承了一个用 8051 汇编语言编写的 10K 行程序，需要进行一些更改。不幸的是，它是用意大利面条式代码的最佳传统编写的。该程序（作为单个文件编写）是 CALL 和 LJMP 语句的迷宫（总共约 1200 个），子例程具有多个入口和/或出口点（如果它们可以被识别为子例程）。所有变量都是全局的。有评论；有些是正确的。没有现有的测试，也没有重构的预算。

该应用程序的一些背景知识：该代码控制当前在国际上部署的自动售货应用程序中的通信中心。它同时处理两个串行流（在单独的通信处理器的帮助下），并且可以与最多四个不同的物理设备通信，每个设备都来自不同的供应商。其中一台设备的制造商最近进行了更改（“是的，我们进行了更改，但软件绝对相同！”），这导致某些系统配置不再工作，并且对取消更改不感兴趣（无论它是什么）他们没有改变）。

该程序最初是由另一家公司编写的，转移给我的客户，然后在九年前由另一位顾问修改。原始公司和顾问都无法作为资源提供。

根据对其中一个串行总线上的流量的分析，我想出了一种破解方法，该方法似乎有效，但它很丑陋，并且没有解决根本原因。如果我对程序有更好的理解，我相信我可以解决实际问题。在代码被冻结以支持月底发货日期之前，我还有大约一周的时间。

原始问题：我需要充分理解该程序才能在不造成破坏的情况下进行更改。有没有人开发出处理这种混乱的技术？

我在这里看到了一些很好的建议，但受到时间的限制。然而，我将来可能还有另一个机会去采取一些更复杂的行动方案。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

海夕 2024-07-31 07:54:55

恐怕没有什么灵丹妙药可以解决这类问题。我发现唯一的解决方案是打印出 ASM 文件，然后去一个安静的地方，在脑海中模拟逐行运行程序（同时在记事本上写入寄存器和内存位置的内容）。一段时间后，您发现这并不需要您预期的时间。
准备好花很多时间做这件事并喝几加仑的咖啡。一段时间后，您就会了解它在做什么，并且可以考虑进行更改。

8051有没有未使用的IO端口？如果确实如此，并且您无法确定何时调用某些例程，则添加代码以将这些备用端口发送为高电平或低电平。然后
当程序运行时，用示波器观察这些端口。

祝你好运

回复收藏 0 原文

半葬歌 2024-07-31 07:54:55

我知道这听起来很疯狂……但我失业了（我选择了错误的时间告诉多数合伙人见鬼去吧）并且有一些空闲时间。我很愿意去看看。我曾经为苹果][ 和原始 PC 编写汇编。如果我可以在模拟器上玩你的代码几个小时，我可以给你一个想法，如果我有机会为你记录它（而不需要运行我计划外的假期）。由于我对 8051 一无所知，这对于像我这样的人来说可能是不可能的，但模拟器看起来很有希望。我不想花任何钱来做这件事。接触 8051 嵌入式开发就足够了。我告诉过你这听起来很疯狂。

回复收藏 0 原文

心凉 2024-07-31 07:54:55

找另一份工作——说真的！如果这本书“有效地处理遗留代码”可能会有所帮助 - 尽管我认为它指的是遗留代码作为没有单元测试的代码。

回复收藏 0 原文

何以畏孤独 2024-07-31 07:54:55

我已经做过几次这样的事情了。一些建议：

首先查看原理图，
这应该可以帮助你理解什么
端口和引脚您想要的更改
影响。
使用 grep 查找所有调用，
分支、跳跃和返回。这个可以
帮助理解流程并识别
代码块。
查看重置向量和
中断表来识别
主线。
使用 grep 创建交叉引用
对于所有代码标签和数据
参考文献（如果你的汇编器
工具无法为您执行此操作）。

请记住霍夫施塔特定律：
即使考虑到霍夫施塔特定律，它所花费的时间也总是比您预期的要长。

祝你好运。

回复收藏 0 原文

不…忘初心 2024-07-31 07:54:55

您对这段代码运行的硬件平台了解多少？

是否已进入掉电模式（Pcon=2）以节省电量
如果有的话，是怎么被唤醒的呢？（复位或硬件中断）
在进行串行通信之前，是否必须等待振荡器在上电后稳定
在进行串行通信之前，
是否已进入睡眠模式 (Pcon=1)

现场是否存在不同版本的硬件？

确保您有所有不同的硬件版本可供测试。

不要在模拟器上浪费时间——它很难使用，而且你必须对硬件做出很多假设。为自己获取一个在线仿真器 (ICE) 并在硬件上运行。

该软件是用汇编程序编写的，您需要找出原因。
IE
- 内存限制
- 速度限制

这段代码混乱可能是有原因的

看一下链接文件：

XDATA SPACE、IDATA SPACE 和 CODE SPACE：

如果没有可用的代码空间或 Xdata 或 Idata？

原作者可能对其进行了优化以适应可用的内存空间。

如果是这种情况，您需要与原始开发人员交谈以了解他做了什么。

回复收藏 0 原文

此生挚爱伱 2024-07-31 07:54:55

您不需要为重构和测试提供特殊的预算——它们可以为您节省金钱并让您工作得更快——开始吧。您应该使用这种技术来向遗留的、继承的代码添加更改，因为这是在“不破坏”的情况下完成此操作的最便宜的方法。

大多数时候，我认为需要进行权衡，即花费更多时间来换取更高的质量，但是对于您不熟悉的遗留代码，我认为进行测试会更快 - 您必须先运行代码你发货了，对吗？

回复收藏 0 原文

Smile简单爱 2024-07-31 07:54:55

这是我建议您将软技能运用到工作中的少数几次之一，并向您的 PM/经理/CXO 展示您重写背后的理由，以及此类任务所节省的时间/成本

回复收藏 0 原文

淡莣 2024-07-31 07:54:55

把它切成块。

回复收藏 0 原文

指尖上的星空 2024-07-31 07:54:55

我在使用 8052 软件时遇到了一些非常类似的问题。因此公司继承了这样一个庞然大物，代码ROM完整（64Kbytes），大约1.5兆的汇编意大利面模块加上两个3000行PL/M模块组成了这个编码怪物。该软件的原始开发人员早已去世（这并不意味着没有人，但实际上没有人能够从整体上理解它），编译这些软件的编译器来自 80 年代中期，在 MDS-70 模拟器上运行，并且几个关键的模块受到这些编译器的限制。比如再添加一个全局符号，链接器就会崩溃。在 ASM 文件中再添加一个符号，编译器就会崩溃。

那么如何开始削减这一点呢？

首先你需要工具。例如，Notepad++ 是一个非常好的东西，因为它可以用于同时交叉搜索多个文件，非常适合查找哪些模块引用全局符号。这可能是最关键的因素。

如果可能的话，获取您可以在该软件上找到的任何论文。对于这些野兽来说，要解决的最直接的问题是了解它们的大致组成方式，它们的架构是什么。这通常不包含在软件本身中，即使它以其他方式正确注释也是如此。

要自己获取架构，首先您可以尝试构建调用图。它比数据流图更简单，因为通常跨文件调用较少。比全局变量跳转。对于此调用图，仅考虑假设源文件应该是模块的全局符号（这不一定是真的，但通常应该是）。

为此，请使用跨文件搜索工具，创建一个大列表（例如在 OpenOffice Calc 中），在其中收集哪个文件中定义了哪个符号，以及哪些文件引用该符号并调用它。

然后从绘图仪中窃取一些大（！）图纸，并开始绘制草图。如果您非常精通某些图形软件，您可能会使用它，但除非如此，否则它更有可能会阻碍您。因此，绘制一个调用图，显示哪个文件调用了哪些其他文件（不显示符号本身，有 50 个左右的文件，您将无法管理它）。

这样做的结果很可能是意大利面条。目标是理顺它，使其成为一个具有根（将是包含程序入口点的文件）而没有循环的分层树。在此过程中，您可能会吞下几张纸，反复理顺这头野兽。您还可能会发现某些文件相互交织在一起，以至于无法在没有循环的情况下表示它们。在这种情况下，很可能单个“模块”以某种方式分离在两个文件中，或者更多的概念模块纠缠在一起。返回到您的调用列表，并对符号进行分组，以便将有问题的文件分割成更小的独立单元（您还需要在此处检查文件本身的本地跳转，以查看您假设的剪切是否可行）。

最后，除非您已经为了自己的利益而在其他地方工作，否则您将获得带有概念模块的分层调用图。由此可以推断出软件的意图架构并进一步工作。

下一个目标是架构。根据之前制作的地图，您将需要沿着软件导航，找出它的线程（中断和主程序任务）以及每个模块/源文件的大致用途。如何做到这一点以及您在这里获得的内容更多地取决于应用程序领域。

当这两个完成后，“剩下的”就相当简单了。通过这些，您基本上应该知道事情的每个部分应该做什么，因此您知道当您开始处理源文件时可能会处理什么。但重要的是，每当您在源中发现一些“可疑”的东西时，程序似乎会做一些不相关的事情，返回到您的体系结构和调用图，并在必要时进行更正。

对于其他人，其他人提到的方法也适用。我只是概述了这些，以便让大家了解在真正可怕的情况下可以做什么。我希望当时我只需要处理 10K 行代码......

I had some very similar problem with a 8052 software. So the company inherited such a beast, code ROM full (64Kbytes), about 1,5 megs of assembly spaghetti modules plus two 3000 lines PL/M modules composed this coding monstrosity. The original developers of the software were long dead (this does not mean there were nobody, but indeed nobody who would understand it as a whole), the compilers compiling these were from the middle 80s running on an MDS-70 emulator, and several critical modules were at the limits of these compilers. Like add one more global symbol, and the linker would crash. Add one more symbol to an ASM file, and the compiler would crash.

So how one could start cutting this up?

First you will need tools. Notepad++ for example is a very nice thing since it can be used to cross search along several files at once, ideal to find which modules refer a global symbol. This is probably the most crucial element.

If possible, get any papers you can find on the software. The most immediate problem to solve with these beasts is to understand how they are roughly composed, what is their architecture. This is usually not included in the software itself, not even if it is otherwise properly commented.

To get the architecture yourself, first you may try to build a call graph. It is simpler to do than a data flow graph since usually there are less cross-file calls & jumps than global variables. For this call graphs only consider global symbols assuming the source files are supposed to be modules (which is not necessarily true, but usually they should be).

To do this, use your tool for cross file search, create a large list (for example in OpenOffice Calc) where you collect which symbol is defined in which file, and which files refer to this symbol calling it.

Then steal some large (!) sheets from the plotter, and start sketching. If you are very proficient in some graph software, you may use it, but unless it's so, it is more likely to hold you back. So sketch up a call graph showing which file has calls to which other files (not showing the symbols themselves, with 50 or so files, you wouldn't be able to manage it).

Most likely the result of this will be a spaghetti. The goal is to straighten this out to get it a hierarchical tree with a root (which will be the file containing the program entry point) without loops. You may devour several sheets during this process iteratively straightening the beast out. You may also find certain files are so much inter-tangled that they can not be represented without loops. This case it is most likely that a single "module" got somehow separated in two files, or more conceptual modules were tangled up. Go back to your call list, and group the symbols so to cut up the problematic files in smaller independent units (you will need to check the file itself too for local jumps here to see your assumed cut is possible).

To the end unless you are already working somewhere else for your own good, you will get a hierarchical call graph with conceptual modules. From this it is possible to deduct the software's intentional architecture and work further.

The next goal is the architecture. By your previously made map you will need to navigate along the software, figure out it's threads (interrupt and main program tasks), and the rough purposes of each of the modules / source files. How you can do this and what you get here depends more on the application domain.

When these two are done, the "rest" is rather straightforward. By these you should essentially know what each part of the thing is supposed to do, and so you know what you are likely dealing with when you start working on a source file. It is important though that whenever you find something "fishy" in a source, that the program seems to do something irrelevant, to go back to your architecture and call graph, and make corrections if necessary.

To the rest the methods others mentioned apply well. I just outlined these to give some insight on what can be done in really hideous cases. I wish I had just 10K lines of code to deal with back then...

回复收藏 0 原文

中二柚 2024-07-31 07:54:55

我想说 IanW 的答案（只需打印出来并继续跟踪）可能是最好的。也就是说，我有一个稍微离谱的想法：

尝试通过可以重构 C 代码的反汇编器运行代码（可能是二进制文件）（如果你能找到一个适用于 8051 的反汇编器）。也许它会识别出一些你无法（轻松地）识别的例程。

也许会有帮助。

回复收藏 0 原文

因为看清所以看轻 2024-07-31 07:54:54

首先，我会尝试与那些最初开发该代码或至少在我之前维护该代码的人取得联系，希望获得足够的信息来对代码有一个基本的了解，以便您可以开始添加有用的注释它。

也许您甚至可以让某人描述代码中最重要的 API（包括它们的签名、返回值和用途）。如果全局状态被函数修改，也应该明确这一点。同样，开始区分函数和过程以及输入/输出寄存器。

您应该向您的雇主明确表示需要此信息，如果他们不相信您，请让他们在您描述您应该做什么以及必须如何做时实际坐在此代码前与您交谈它（逆向工程）。在这种情况下，拥有一个具有计算和编程背景的雇主实际上会很有帮助！

如果你的雇主没有这样的技术背景，请他带另一位程序员/同事向他解释你的步骤，这样做实际上会向他表明你对此是认真和诚实的，因为这是一个真正的问题 - 不仅仅是从你的角度来看（确保有了解这个“项目”的同事）。

如果可行并且可行，我也会非常清楚地表明，与前开发人员/维护人员（即，如果他们不再为您的公司工作）签订合同（或至少联系）来帮助记录此代码将是一个预先的准备工作。 - 在短时间内切实改进代码并确保将来更容易维护的必要条件。

强调整个情况是由于之前软件开发过程中的缺陷造成的，这些步骤将有助于改进代码库。因此，当前形式的代码库是一个日益严重的问题，现在为解决这个问题所做的一切都是对未来的投资。

这本身对于帮助他们评估和理解你的情况也很重要：做你现在应该做的事情远非微不足道，他们应该知道这一点 - 如果只是为了明确他们的期望（例如关于最后期限和复杂性）任务）。

另外，就我个人而言，我会开始为那些我足够理解的部分添加单元测试，以便我可以慢慢开始重构/重写一些代码。

换句话说，良好的文档和源代码注释是一回事，但拥有一个全面的测试套件是另一重要的事情，没有人可以在没有任何既定的测试关键功能的方法的情况下修改不熟悉的代码库。

鉴于代码为 10K，我还会考虑将子例程分解为单独的文件，以使组件更易于识别，最好使用访问包装器而不是全局变量以及直观的文件名。

此外，我将研究通过降低复杂性来进一步提高源代码的可读性的步骤，具有多个入口点（甚至可能不同的参数签名？）的子例程看起来像是不必要地混淆代码的可靠方法。

同样，巨大的子例程也可以重构为较小的子例程，以帮助提高可读性。

因此，我要做的第一件事就是确定那些使理解代码库变得非常复杂的事情，然后重新设计这些部分，例如通过将具有多个入口点的巨大子例程拆分为不同的子例程而是互相调用的子例程。
如果由于性能原因或调用开销而无法完成此操作，请改用宏。

此外，如果这是一个可行的选择，我会考虑使用更高级的语言逐步重写部分代码，或者通过使用 C 的子集，或者至少通过相当过多地使用汇编宏来帮助标准化代码基础，还可以帮助定位潜在的错误。

如果用 C 进行增量重写是一个可行的选择，那么一种可能的开始方法是将所有明显的函数转换为 C 函数，其函数体从一开始就从汇编文件中复制/粘贴，以便最终得到 C具有大量内联汇编的函数。

就我个人而言，我还会尝试在 simulator/emulator 中运行代码，以轻松地单步执行代码并希望开始理解最重要的构建块（在检查寄存器和堆栈使用情况时），如果您确实需要自己完成大部分工作，则应该为您提供一个带有内置调试器的良好 8051 模拟器。

这也将帮助您提出初始化序列和主循环结构以及调用图。

也许，您甚至可以找到一个很好的开源 80851 模拟器，可以轻松修改它以自动提供完整的调用图，只需快速搜索，我发现 gsim51，但显然还有其他几个选项，以及各种专有选项。

如果我处于您的情况，我什至会考虑外包修改我的工具的工作，以简化使用此源代码的工作，即许多 sourceforge 项目接受捐赠，也许您可以说服您的雇主赞助此类修改。

如果不是经济上的，也许你可以提供相应的补丁？

如果您已经在使用专有产品，您甚至可以与该软件的制造商交谈并详细说明您的要求，并询问他们是否愿意以这种方式改进该产品，或者他们是否至少可以公开一个接口以允许客户进行此类定制（某种形式的内部 API，甚至可能是简单的粘合脚本）。

如果他们没有响应，表明您的雇主已经考虑使用不同的产品一段时间了，并且您是唯一坚持使用该特定产品的人......;-)

如果软件期望某些我/对于硬件和外设，您甚至可能想要编写相应的硬件模拟循环来在模拟器中运行软件。

最终，我知道一个事实，我个人更喜欢定制其他软件来帮助我理解这样一个意大利面条代码怪物的过程，而不是手动单步执行代码并自己玩模拟器，无论我能喝多少加仑咖啡得到。

从开源 8051 仿真器中获取可用的调用图应该不会比周末（最多）花费更长的时间，因为它主要意味着寻找 CALL 操作码并记录它们的地址（位置和目标），以便将所有内容转储到归档以供日后检查。

访问模拟器的内部结构实际上也是进一步检查代码的好方法，例如为了找到操作码的重复模式（例如 20-50+），这可能会被分解为独立的函数/过程，这实际上可能有助于进一步减少代码库的大小和复杂性。

下一步可能是检查堆栈和寄存器的使用情况。并确定所使用的函数参数的类型/大小，以及它们的值范围——以便您可以构思相应的单元测试。

与手动完成所有这些工作相比，使用 dot/graphviz 等工具来可视化初始化序列和主循环本身的结构将是一种纯粹的乐趣。

此外，您实际上最终会获得有用的数据和文档，从长远来看，它们可以作为更好的文档的基础。

First, I would try to get in touch with those people who originally developed the code or who at least maintained it before me, hopefully getting enough information to get a basic understanding of the code in general, so that you can start adding useful comments to it.

Maybe you can even get someone to describe the most important APIs (including their signature, return values and purpose) for the code. If global state is modified by a function, this should also be made explicit. Similarly, start to differentiate between functions and procedures, as well as input/output registers.

You should make it very clear to your employer that this information is required, if they don't believe you, have them actually sit down with you in front of this code while you describe what you are supposed to do and how you have to do it (reverse engineering). Having an employer with a background in computing and programming will actually be helpful in that case!

If your employer doesn't have such a technical background, ask him to bring another programmer/colleague to explain your steps to him, doing so will actually show him that you are serious and honest about it, because it's a real issue - not just from your point of view (make sure to have colleagues who know about this 'project').

If available and feasible, I would also make it very clear, that contracting (or at the very least contacting) former developers/maintainers (if they are no longer working for your company, that is) to help document this code would be a pre-requisite to realistically improve the code within a short time span and to ensure that it can be more easily maintained in the future.

Emphasize that this whole situation is due to shortcomings in the previous software development process and that these steps will help improve the code base. So, the code base in its current form is a growing problem and whatever is done now to handle this problem is an investment for the future.

This in itself is also important to help them assess and understand your situation: To do what you are supposed to do now is far from trivial, and they should know about it - if only to set their expectations straight (e.g. regarding deadlines and complexity of the task).

Also, personally I would start adding unit tests for those parts that I understand well enough, so that I can slowly start refactoring/rewriting some code.

In other words, good documentation and source code comments are one thing, but having a comprehensive test suite is another important thing, noone can be realistically expected to modify an unfamiliar code base without any established way of testing key functionality.

Given that the code is 10K, I would also look into factoring out subroutines into separate files to make components more identifiable, preferably using access wrappers instead of global variables and also intuitive file names.

Besides, I would look into steps to further improve the readability of the source code by decreasing the complexity, having sub routines with multiple entry points (and possibly even different parameter signatures?) looks like a sure way to obfuscate the code unnecessarily.

Similarly, huge sub routines could also be refactored into smaller ones to help improve readability.

So, one of the very first things, I'd look into doing would be to determine those things that make it really complicated to grok the code base and then rework those parts, for example by splitting huge sub routines with multiple entry points into distinct sub routines that call each other instead.
If this cannot be done due to performance reasons or call overhead, use macros instead.

In addition, if it is a viable option, I would consider incrementally rewriting portions of the code using a more high level language, either by using a subset of C, or at least by making fairly excessive use of assembly macros to help standardize the code base, but also to help localize potential bugs.

If an incremental rewrite in C is a feasible option, one possible way to get started would be to turn all obvious functions into C functions whose bodies are -in the beginning- copied/pasted from the assembly file, so that you end up with C functions with lots of inline assembly.

Personally, I would also try running the code in a simulator/emulator to easily step through the code and hopefully start understanding the most important building blocks (while examining register and stack usage), a good 8051 simulator with a built-in debugger should be made available to you if you really have to do this largely on your own.

This would also help you come up with the initialization sequence and main loop structure as well as a callgraph.

Maybe, you can even find a good open source 80851 simulator that can be easily modified to also provide a full callgraph automatically, just doing a quick search, I found gsim51, but there are obviously several other options, various proprietary ones as well.

If I were in your situation, I would even consider outsourcing the effort of modifying my tools to simplify working with this source code, i.e. many sourceforge projects accept donations and maybe you can talk your employer into sponsoring such a modification.

If not financially, maybe by you providing corresponding patches to it?

If you are already using a proprietary product, you might even be able to talk with the manufacturer of this software and detail your requirements and ask them if they are willing to improve this product that way or if they can at least expose an interface to allow customers to make such customizations (some form of internal API or maybe even simple glue scripts).

If they are not responsive, indicate that your employer has been thinking of using a different product for some time now and that you were the only one insisting on that particular product to be used ... ;-)

If the software expects certain I/O hardware and peripherals, you may even want to look into writing a corresponding hardware simulation loop to run the software in an emulator.

Ultimately, I know for a fact that I would personally much more enjoy the process of customizing other software to help me understand such a spaghetti code monster, than manually stepping through the code and playing emulator myself, no matter how many gallons of coffee I can get.

Getting a usable callgraph out of an open source 8051 emulator should not take much longer than say a weekend (at most), because it mostly means to look for CALL opcodes and record their addresses (position and target), so that everything's dumped to a file for later inspection.

Having access to an emulator's internals would actually also be great a way to further inspect the code, for example in order to find recurring patterns of opcodes (say 20-50+), that may be factored into standalone functions/procedures, this might actually help decrease the size and complexity of the code base even further.

The next step would probably be to examine stack and register usage. And to determine the type/size of function parameters used, as well as their value range - so that you can conceive corresponding unit tests.

Using tools like dot/graphviz to visualize the structure of the initialization sequence and the main loop itself, will be a pure joy compared to doing all this stuff manually.

Also, you'll actually end up with useful data and documents that can serve as the foundation for better documentation in the long run.

回复收藏 0 原文

~没有更多了~