从二进制文件确定源语言?

发布于 2024-08-10 13:20:11 字数 548 浏览 7 评论 0原文

回复了关于用非 Objective-C 语言为 iPhone 进行开发的另一个问题,我断言使用 C# 为 iPhone 编写代码会让 Apple 审阅者感到错误。我主要谈论的是所讨论的 ObjC 和 C# 库之间 UI 元素的不同,但一位评论者提出了一个有趣的观点,让我想到了这个问题:

是否可以仅根据程序的语言来确定程序编写的语言二进制?如果有这样的方法,它们是什么?

为了解决这个问题,我们假设:

  • 从交互的角度(控制台行为、任何 GUI 外观等)来看,两者是相同的。
  • 这种性能并不是语言的可靠指标(例如,不能将 Java 与 C 进行比较)。
  • 你和语言之间没有解释器或其他东西——只有原始的可执行二进制文件。

如果您尽可能不了解语言,那就加分了。

I responded to another question about developing for the iPhone in non-Objective-C languages, and I made the assertion that using, say, C# to write for the iPhone would strike an Apple reviewer wrong. I was speaking largely about UI elements differing between the ObjC and C# libraries in question, but a commenter made an interesting point, leading me to this question:

Is it possible to determine the language a program is written in, solely from its binary? If there are such methods, what are they?

Let's assume for the purposes of the question:

  • That from an interaction standpoint (console behavior, any GUI appearance, etc.) the two are identical.
  • That performance isn't a reliable indicator of language (no comparing, say, Java to C).
  • That you don't have an interpreter or something between you and the language - just raw executable binary.

Bonus points if you're language-agnostic as possible.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

(り薆情海 2024-08-17 13:20:11

简短回答:

详细回答:

如果您查看二进制文件,您可以找到已链接的库的名称。在 TextPad 中打开 cmd.exe 很容易在十六进制偏移量 0x270 处找到以下内容: msvcrt.dll、KERNEL32.dll、NTDLL.DLL、USER32.dll 等。msvcrt 是 Microsoft 'C' 运行时支持函数。 KERNEL32、NTDLL 和 USER32.dll 是操作系统特定的库,它们告诉您目标平台或构建它的平台,具体取决于跨平台开发环境将两者分开的程度。

抛开这些线索不谈,大多数 c/c++ 编译器都必须将函数名称插入到二进制文件中,表中存储了所有函数(或入口点)的列表。 C++“破坏”函数名称以对参数及其类型进行编码以支持重载方法。可以混淆函数名称,但它们仍然存在。函数签名将包括参数的数量和类型,可用于跟踪程序中使用的系统或内部调用。偏移量 0x4190 处是“SetThreadUILanguage”,可以搜索它以查找有关 开发环境。我在偏移量 0x1ED8A 处找到了入口点表。我可以轻松地看到 printf、exit 和 scanf 等名称;与 __p__fmode、__p__commode 和 __initenv 一起,

x86 处理器的任何可执行文件都将有一个数据段,其中包含程序中包含的任何静态文本。回到cmd.exe(偏移0x42C8)是文本“Software.Policies.Microsoft.Windows.System”。该字符串占用的字符数是通常所需字符数的两倍,因为它是使用双宽字符存储的,可能是为了国际化。错误代码或消息是这里的主要来源。

在偏移量 B1B0 处是“pushd”,后跟 mkdir、rmdir、chdir、md、rd 和 cd;为了便于阅读,我省略了不可打印的字符。这些都是 cmd.exe 的命令参数。

对于其他程序,我有时能够找到编译程序的路径。

所以,是的,可以从二进制文件中确定源语言。

Short answer: YES

Long answer:

If you look at a binary, you can find the names of the libraries that have been linked in. Opening cmd.exe in TextPad easily finds the following at hex offset 0x270: msvcrt.dll, KERNEL32.dll, NTDLL.DLL, USER32.dll, etc. msvcrt is the Microsoft 'C' runtime support functions. KERNEL32, NTDLL, and USER32.dll are OS specific libraries which tell you either the target platform, or the platform on which it was built, depending on how well the cross-platform development environment segregates the two.

Setting aside those clues, most any c/c++ compiler will have to insert the names of the functions into the binary, there is a list of all functions (or entrypoints) stored in a table. C++ 'mangles' the function names to encode the arguments and their types to support overloaded methods. It is possible to obfuscate the function names but they would still exist. The functions signatures would include the number and types of the arguments which can be used to trace into the system or internal calls used in the program. At offset 0x4190 is "SetThreadUILanguage" which can be searched for to find out a lot about the development environment. I found the entry-point table at offset 0x1ED8A. I could easily see names like printf, exit, and scanf; along with __p__fmode, __p__commode, and __initenv

Any executable for the x86 processor will have a data segment which will contain any static text that was included in the program. Back to cmd.exe (offset 0x42C8) is the text "S.o.f.t.w.a.r.e..P.o.l.i.c.i.e.s..M.i.c.r.o.s.o.f.t..W.i.n.d.o.w.s..S.y.s.t.e.m.". The string takes twice as many characters as is normally necessary because it was stored using double-wide characters, probably for internationalization. Error codes or messages are a prime source here.

At offset B1B0 is "p.u.s.h.d" followed by mkdir, rmdir, chdir, md, rd, and cd; I left out the unprintable characters for readability. Those are all command arguments to cmd.exe.

For other programs, I've sometimes been able to find the path from which a program was compiled.

So, yes, it is possible to determine the source language from the binary.

放肆 2024-08-17 13:20:11

我不是编译器黑客(我希望有一天),但我认为您也许能够在二进制文件中找到明显的迹象,这些迹象表明编译器生成了它以及使用的一些编译器选项,例如指定优化。

但严格来说,你所要求的是不可能的。可能有人坐下来用笔和纸计算出与他们想要编写的程序相对应的二进制代码,然后在十六进制编辑器中输入这些内容。基本上,他们将在没有汇编工具的情况下进行汇编编程。同样,您可能永远无法确定本机二进制文件是用直接汇编程序编写的还是用内联汇编程序用 C 语言编写的。

对于 JVM 和 .NET 等虚拟机环境,我希望您应该能够通过二进制可执行文件中的字节码来识别虚拟机。但是,您可能无法分辨源语言是什么,例如 C# 与 Visual Basic,除非有特定的编译器怪癖提示您。

I'm not a compiler hacker (someday, I hope), but I figure that you may be able to find telltale signs in a binary file that would indicate what compiler generated it and some of the compiler options used, such as the level of optimization specified.

Strictly speaking, however, what you're asking is impossible. It could be that somebody sat down with a pen and paper and worked out the binary codes corresponding to the program that they wanted to write, and then typed that stuff out in a hex editor. Basically, they'd be programming in assembly without the assembler tool. Similarly, you may never be able to tell with certainty whether a native binary was written in straight assembler or in C with inline assembly.

As for virtual machine environments such as JVM and .NET, you should be able to identify the VM by the byte codes in the binary executable, I would expect. However you may not be able to tell what the source language was, such as C# versus Visual Basic, unless there are particular compiler quirks that tip you off.

绮烟 2024-08-17 13:20:11

这些工具怎么样:

PE Detective

PEiD

都是 PE 标识符。好吧,它们都是针对 Windows 的,但这就是我登陆这里时的情况

what about these tools:

PE Detective

PEiD

both are PE Identifiers. ok, they're both for windows but that's what it was when i landed here

梦幻之岛 2024-08-17 13:20:11

我希望你可以,如果你反汇编源代码,或者至少你可能知道编译器,因为并非所有编译器都会使用相同的代码 printf 例如,所以 Objective-C 和 gnu C 应该不同这里。

您已排除所有字节码语言,因此此问题不会像预期的那样常见。

I expect you could, if you disassemble the source, or at least you may know the compiler, as not all compilers will use the same code for printf for example, so Objective-C and gnu C should differ here.

You have excluded all byte-code languages so this issue is going to be less common than expected.

苦笑流年记忆 2024-08-17 13:20:11

首先,在一些二进制文件上运行 what 并查看输出。 CVS(和 SVN)标识符分散在整个二进制映像中。其中大部分来自图书馆。

此外,通常还有各种库函数的“映射”。这也是一个很大的暗示。

当库链接到可执行文件时,二进制文件中通常会包含一个带有名称和偏移量的映射。这是创建“位置无关代码”的一部分。您不能简单地将各种目标文件“硬链接”在一起。您需要一个映射,并且在将二进制文件加载到内存中时必须进行一些查找。

最后,C、C++(我想是 C#)的启动模块对于该编译器的默认库集是唯一的。

First, run what on some binaries and look at the output. CVS (and SVN) identifiers are scattered throughout the binary image. And most of those are from libraries.

Also, there's often a "map" to the various library functions. That's a big hint, also.

When the libraries are linked into the executable, there is often a map that's included in the binary file with names and offsets. It's part of creating "position independent code". You can't simply "hard-link" the various object files together. You need a map and you have to do some lookups when loading the binary into memory.

Finally, the start-up module for C, C++ (and I imagine C#) is unique to that compiler's defaiult set of libraries.

浅唱々樱花落 2024-08-17 13:20:11

不,字节码与语言无关。不同的编译器甚至可以采用相同的代码源并生成不同的二进制文件。这就是为什么您看不到适用于二进制文件的通用反编译器。

No, the bytecode is language agnostic. Different compilers could even take the same code source and generate different binaries. That's why you don't see general purpose decompilers that will work on binaries.

猥︴琐丶欲为 2024-08-17 13:20:11

命令“strings”可用于获取有关使用哪种语言的一些提示(例如,我刚刚在我编写的 C 应用程序的剥离二进制文件上运行它,它找到的第一个条目是由可执行文件链接的库) 。

The command 'strings' could be used to get some hints as to what language was used (for instance, I just ran it on the stripped binary for a C application I wrote and the first entries it finds are the libraries linked by the executable).

亚希 2024-08-17 13:20:11

嗯,C 最初转换为 ASM,因此您可以在 ASM 中编写所有 C 代码。

Well, C is initially converted the ASM, so you could write all C code in ASM.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文