从二进制文件确定源语言？

发布于 2024-08-10 13:20:11 字数 548 浏览 7 评论 0原文

我回复了关于用非 Objective-C 语言为 iPhone 进行开发的另一个问题，我断言使用 C# 为 iPhone 编写代码会让 Apple 审阅者感到错误。我主要谈论的是所讨论的 ObjC 和 C# 库之间 UI 元素的不同，但一位评论者提出了一个有趣的观点，让我想到了这个问题：

是否可以仅根据程序的语言来确定程序编写的语言二进制？如果有这样的方法，它们是什么？

为了解决这个问题，我们假设：

从交互的角度（控制台行为、任何 GUI 外观等）来看，两者是相同的。
这种性能并不是语言的可靠指标（例如，不能将 Java 与 C 进行比较）。
你和语言之间没有解释器或其他东西——只有原始的可执行二进制文件。

如果您尽可能不了解语言，那就加分了。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

(り薆情海 2024-08-17 13:20:11

简短回答：是

详细回答：

如果您查看二进制文件，您可以找到已链接的库的名称。在 TextPad 中打开 cmd.exe 很容易在十六进制偏移量 0x270 处找到以下内容： msvcrt.dll、KERNEL32.dll、NTDLL.DLL、USER32.dll 等。msvcrt 是 Microsoft 'C' 运行时支持函数。 KERNEL32、NTDLL 和 USER32.dll 是操作系统特定的库，它们告诉您目标平台或构建它的平台，具体取决于跨平台开发环境将两者分开的程度。

抛开这些线索不谈，大多数 c/c++ 编译器都必须将函数名称插入到二进制文件中，表中存储了所有函数（或入口点）的列表。 C++“破坏”函数名称以对参数及其类型进行编码以支持重载方法。可以混淆函数名称，但它们仍然存在。函数签名将包括参数的数量和类型，可用于跟踪程序中使用的系统或内部调用。偏移量 0x4190 处是“SetThreadUILanguage”，可以搜索它以查找有关开发环境。我在偏移量 0x1ED8A 处找到了入口点表。我可以轻松地看到 printf、exit 和 scanf 等名称；与 __p__fmode、__p__commode 和 __initenv 一起，

x86 处理器的任何可执行文件都将有一个数据段，其中包含程序中包含的任何静态文本。回到cmd.exe（偏移0x42C8）是文本“Software.Policies.Microsoft.Windows.System”。该字符串占用的字符数是通常所需字符数的两倍，因为它是使用双宽字符存储的，可能是为了国际化。错误代码或消息是这里的主要来源。

在偏移量 B1B0 处是“pushd”，后跟 mkdir、rmdir、chdir、md、rd 和 cd；为了便于阅读，我省略了不可打印的字符。这些都是 cmd.exe 的命令参数。

对于其他程序，我有时能够找到编译程序的路径。

所以，是的，可以从二进制文件中确定源语言。

Short answer: YES

Long answer:

If you look at a binary, you can find the names of the libraries that have been linked in. Opening cmd.exe in TextPad easily finds the following at hex offset 0x270: msvcrt.dll, KERNEL32.dll, NTDLL.DLL, USER32.dll, etc. msvcrt is the Microsoft 'C' runtime support functions. KERNEL32, NTDLL, and USER32.dll are OS specific libraries which tell you either the target platform, or the platform on which it was built, depending on how well the cross-platform development environment segregates the two.

Setting aside those clues, most any c/c++ compiler will have to insert the names of the functions into the binary, there is a list of all functions (or entrypoints) stored in a table. C++ 'mangles' the function names to encode the arguments and their types to support overloaded methods. It is possible to obfuscate the function names but they would still exist. The functions signatures would include the number and types of the arguments which can be used to trace into the system or internal calls used in the program. At offset 0x4190 is "SetThreadUILanguage" which can be searched for to find out a lot about the development environment. I found the entry-point table at offset 0x1ED8A. I could easily see names like printf, exit, and scanf; along with __p__fmode, __p__commode, and __initenv

Any executable for the x86 processor will have a data segment which will contain any static text that was included in the program. Back to cmd.exe (offset 0x42C8) is the text "S.o.f.t.w.a.r.e..P.o.l.i.c.i.e.s..M.i.c.r.o.s.o.f.t..W.i.n.d.o.w.s..S.y.s.t.e.m.". The string takes twice as many characters as is normally necessary because it was stored using double-wide characters, probably for internationalization. Error codes or messages are a prime source here.

At offset B1B0 is "p.u.s.h.d" followed by mkdir, rmdir, chdir, md, rd, and cd; I left out the unprintable characters for readability. Those are all command arguments to cmd.exe.

For other programs, I've sometimes been able to find the path from which a program was compiled.

So, yes, it is possible to determine the source language from the binary.

回复收藏 0 原文