如果这听起来有点愚蠢,我真的很抱歉。我刚刚读完《K&R》并做了一些练习。今年夏天,对于我的项目,我正在考虑重新实现一个 Linux 实用程序,以进一步扩展我对 C 的理解,因此我下载了 GNU tar 和 sed 的源代码,因为它们看起来都很有趣。然而,我很难理解它从哪里开始,主要实现在哪里,所有奇怪的宏来自哪里等等。
我有很多时间,所以这不是真正的问题。我是否应该首先熟悉 GNU 工具链(即 make、binutils、..)才能理解程序?或者也许我应该从一些较小的东西开始(如果有这样的事情)?
如果这很重要的话,我对 Java、C++ 和 python 有一点经验。
谢谢!
I'm really sorry if this sounds kinda dumb. I just finished reading K&R and I worked on some of the exercises. This summer, for my project, I'm thinking of re-implementing a linux utility to expand my understanding of C further so I downloaded the source for GNU tar and sed as they both seem interesting. However, I'm having trouble understanding where it starts, where's the main implementation, where all the weird macros came from, etc.
I have a lot of time so that's not really an issue. Am I supposed to familiarize myself with the GNU toolchain (ie. make, binutils, ..) first in order to understand the programs? Or maybe I should start with something a bit smaller (if there's such a thing) ?
I have little bit of experience with Java, C++ and python if that matters.
Thanks!
发布评论
评论(9)
GNU 程序庞大而复杂。 GNU Hello World 的大小表明,即使是最简单的 GNU 项目也需要大量的代码和配置围绕它。
对于初学者来说,自动工具很难理解,但是您不需要理解它们就可以阅读代码。即使您修改了代码,大多数时候您也可以简单地运行 make 来编译您的更改。
要阅读代码,您需要一个好的编辑器(VIM、Emacs)或 IDE(Eclipse)以及一些用于浏览源代码的工具。 tar 项目包含一个 src 目录,这是一个很好的起点。程序总是以 main 函数开始,因此请执行
或使用 IDE 来搜索该函数。它位于 tar.c 中。现在,跳过所有初始化内容,直到
您看到子命令的开关。如果你传递 -x 它会执行此操作,如果你传递 -c 它会执行此操作,等等。这是这些命令的分支结构。如果你想知道这些宏是什么,运行
那里你可以看到它们列在 common.h 中。
在 EXTRACT_SUBCOMMAND 下面,你会看到一些有趣的东西:
read_and() 的定义(再次使用 grep 获得):
单个参数是一个类似于回调的函数指针,因此 read_and 应该会读取一些内容,然后调用函数<代码>extract_archive。再次,对其进行 grep,您将看到以下内容:
请注意,真正的工作发生在调用
fun
时。fun
又是一个函数指针,在prepare_to_extract 中设置。fun
可能指向extract_file
,它负责实际的写入。我希望我已经向您介绍了这一点,并向您展示了我如何浏览源代码。如果您有相关问题,请随时与我联系。
The GNU programs big and complicated. The size of GNU Hello World shows that even the simplest GNU project needs a lot of code and configuration around it.
The autotools are hard to understand for a beginner, but you don't need to understand them to read the code. Even if you modify the code, most of the time you can simply run make to compile your changes.
To read code, you need a good editor (VIM, Emacs) or IDE (Eclipse) and some tools to navigate through the source. The tar project contains a src directory, that is a good place to start. A program always start with the main function, so do
or use your IDE to search for this function. It is in tar.c. Now, skip all the initialization stuff, untill
There, you see a switch for subcommands. If you pass -x it does this, if you pass -c it does that, etc. This is the branching structure for those commands. If you want to know what these macro's are, run
there you can see that they are listed in common.h.
Below EXTRACT_SUBCOMMAND you see something funny:
The definition of read_and() (again obtained with grep):
The single parameter is a function pointer like a callback, so read_and will supposedly read something and then call the function
extract_archive
. Again, grep on it and you will see this:Note that the real work happens when calling
fun
.fun
is again a function pointer, which is set in prepare_to_extract.fun
may point toextract_file
, which does the actual writing.I hope I walked you a great deal through this and shown you how I navigate through source code. Feel free to contact me if you have related questions.
tar
和sed
等程序的问题是双重的(当然,这只是我的观点!)。首先,他们都真的很老了。这意味着多年来有多个人维护它们,具有不同的编码风格和不同的个性。对于 GNU 实用程序来说,它通常相当不错,因为它们通常强制执行相当一致的编码风格,但这仍然是一个问题。另一个问题是它们的便携性令人难以置信。通常,“可移植性”被视为一件好事,但如果走向极端,这意味着您的代码库最终会充满一些小技巧和技巧,以解决特定硬件和系统中的模糊错误和极端情况。对于像tar
和sed
这样广泛移植的程序,这意味着需要考虑大量的极端情况和晦涩的硬件/编译器/操作系统帐户。如果你想学习 C,那么我想说最好的起点不是尝试研究别人编写的代码。相反,尝试自己编写代码。如果您确实想从现有的代码库开始,请选择一个正在积极维护的代码库,您可以在其中看到其他人所做的更改他们所做的,请按照邮件列表上的讨论进行操作,很快。
使用
tar
和sed
等成熟的程序,您可以看到本来会发生的讨论的结果,但您看不到如何实时做出软件设计决策和更改。只有主动维护的软件才会发生这种情况。当然这只是我的意见,如果你愿意的话,你可以持保留态度:)
The problem with programs like
tar
andsed
is twofold (this is just my opinion, of course!). First of all, they're both really old. That means they've had multiple people maintain them over the years, with different coding styles and different personalities. For GNU utilities, it's usually pretty good, because they usually enforce a reasonably consistent coding style, but it's still an issue. The other problem is that they're unbelievably portable. Usually "portability" is seen as a good thing, but when taken to extremes, it means your codebase ends up full of little hacks and tricks to work around obscure bugs and corner cases in particular pieces of hardware and systems. And for programs as widely ported astar
andsed
, that means there's a lot of corner cases and obscure hardware/compilers/OSes to take into account.If you want to learn C, then I would say the best place to start is not trying to study code that others have written. Rather, try to write code yourself. If you really want to start with an existing codebase, choose one that's being actively maintained where you can see the changes that other people are making as they make them, follow along in the discussions on the mailing lists and so on.
With well-established programs like
tar
andsed
, you see the result of the discussions that would've happened, but you can't see how software design decisions and changes are being made in real-time. That can only happen with actively-maintained software.That's just my opinion of course, and you can take it with a grain of salt if you like :)
为什么不下载 coreutils 的源代码 (http://ftp.gnu.org/gnu/coreutils/< /a>)并查看像
yes
这样的工具?不到 100 行 C 代码,是一个功能齐全、有用且真正基础的 GNU 软件。Why not download the source of the coreutils (http://ftp.gnu.org/gnu/coreutils/) and take a look at tools like
yes
? Less than 100 lines of C code and a fully functional, useful and really basic piece of GNU software.GNU Hello 可能是最小、最简单的 GNU 程序,并且很容易理解。
GNU Hello is probably the smallest, simplest GNU program and is easy to understand.
我知道有时浏览 C 代码会很混乱,特别是如果您不熟悉它的话。我建议您使用工具来帮助您浏览函数、符号、宏等。然后查看对于 main() 函数。
当然,您需要熟悉这些工具,但不需要成为专家。
I know sometimes it's a mess to navigate through C code, especially if you're not familiarized with it. I suggest you use a tool that will help you browse through the functions, symbols, macros, etc. Then look for the main() function.
You need to familiarize yourself with the tools, of course, but you don't need to become an expert.
如果您还不知道的话,请了解如何使用 grep 并使用它来搜索 main 函数以及您感兴趣的其他所有内容。您可能还想使用代码浏览工具,例如 ctags 或 cscope 它也可以与 vim 和 emacs 集成,或者如果您更喜欢的话可以使用 IDE。
Learn how to use grep if you don't know it already and use it to search for the main function and everything else that interests you. You might also want to use code browsing tools like ctags or cscope which can also integrate with vim and emacs or use an IDE if you like that better.
我建议使用 ctags 或 cscope 用于浏览。您可以将它们与 vim/emacs.它们在开源世界中被广泛使用。
它们应该位于每个主要 Linux 发行版的存储库中。
I suggest using ctags or cscope for browsing. You can use them with vim/emacs. They are widely used in the open-source world.
They should be in the repository of every major linux distribution.
理解一些使用大量宏、实用函数等的代码可能很困难。为了更好地浏览随机C或C++软件的代码,我建议采用这种方法,这也是我通常使用的:
安装Qt开发工具和Qt Creator
下载您想要检查的源代码,并将它们设置为编译(对于 GNU 内容,通常只需
./configure
)。在源目录的根目录中运行
qmake -project
,为Qt Creator生成Qt.pro
文件。在 Qt Creator 中打开
.pro
文件(当需要时,不要使用影子构建)。为了安全起见,在 Qt Creator 项目视图中,删除默认的构建步骤。
.pro
文件仅用于在 Qt Creator 中导航。可选:如果您想在 Qt Creator 下构建和运行/调试,请设置自定义构建和运行步骤。仅导航不需要。
使用Qt Creator浏览代码。请特别注意按名称查找内容的定位器(kb 快捷键 Ctrl+K)、“跟随光标下的符号”(kb 快捷键 F2)和“查找用法”(kb 快捷键 Ctrl-Shift-U)。
Making sense of some code which uses a lot of macros, utility functions, etc, can be hard. To better browse the code of a random C or C++ software, I suggest this approach, which is what I generally use:
Install Qt development tools and Qt Creator
Download the sources you want to inspect, and set them up for compilation (usually just
./configure
for GNU stuff).Run
qmake -project
in the root of the source directory, to generate Qt.pro
file for Qt Creator.Open the
.pro
file in Qt Creator (do not use shadow build, when it asks).Just to be safe, in Qt Creator Projects view, remove the default build steps. The
.pro
file is just for navigation inside Qt Creator.Optional: set up custom build and run steps, if you want to build and run/debug under Qt Creator. Not needed for navigation only.
Use Qt Creator to browse the code. Note especially the locator (kb shortcut Ctrl+K) to find stuff by name, and "follow symbol under cursor" (kb shortcut F2), and "find usages" (kb shortcut Ctrl-Shift-U).
我必须看一下“sed”才能看看问题出在哪里;它不应该那么大。我看了看,明白了问题所在,我感觉就像查尔顿·赫斯顿第一次看到海滩上一座破碎的雕像一样。我将要描述的“sed”的所有内容也可能适用于“tar”。但我还没有看过它。
很多 GNU 代码都被严重破坏了——到了无法维护的病态遗产的程度——出于我不知道的原因。我不知道它到底是什么时候发生的,也许是 1990 年代末或 2000 年代初,但就像有人按下了开关,突然间,所有漂亮的模块化、大部分独立的代码小部件都被各种无关紧要的纠缠物所大量破坏。与应用程序本身试图执行的操作没有任何联系。
在您的情况下,“sed”:整个库(不必要地)被拖入应用程序中。至少早在 4.2 版本(您的查询之前的最后一个版本)就是这种情况,可能在那之前 - 我必须检查一下。
另一件被破坏的事情是构建系统(再次)到了不可维护的地步。
所以,你在这里真正谈论的是遗产救援。
我的建议...对于任何已经存在很长时间的代码库来说都是通用的...就是尽可能深入地挖掘并首先回到其最早的形式;并扩展并查看其他“sed”——例如 UNIX 档案中的那些。
https://www.tuhs.org/Archive/
或在 BSD 存档中:
https://github.com/freebsd
https://github.com/weiss/original-bsd
(第二个在其早期提交中更深入地探讨了早期 BSD。)
许多“sed”都在GNU 页面(但不是全部)可以在“下载”下找到,作为 GNU sed 页面上的“镜像”链接:
https://www.gnu.org/software/sed/
1.18 版本仍然完好无损。版本 1.17 隐含完整,因为存在 1.17 到 1.18 的差异。这两个版本都没有将所有额外的东西堆在上面。它更能代表 GNU 软件在陷入所有纠缠之前的样子。
它实际上非常小 - *.c 和 *.h 文件总共只有 8863 行。从那开始。
对我来说,任何代码库的分析过程都会破坏原始代码,并且总是需要大量的重构和重新设计;简化来自于更好、更原生地编写它,同时保留或增加其功能。几乎总是,它是由只有几年经验的人写的(我的意思是:例如,不到 20 年),因此没有完全掌握母语流利程度,也没有广泛的背景知识。能够很好地编程。
为此,如果您这样做,强烈建议您已经准备好或添加了一些测试套件。例如,4.2 版软件中有一个,尽管它可能是对 1.18 和 4.2 之间添加的新功能进行压力测试。请注意这一点。 (因此,可能需要减少测试套件以适应 1.18。)您所做的每个更改都必须通过套件中的任何测试进行验证。
您需要拥有母语流利的语言……或者通过进行练习和其他类似的练习来获得它的意愿和能力。如果你没有足够的岁月,你就会碰壁。走得越深,前进就越困难。这说明你的经验还不够,宽度还不够。因此,这个练习将成为你学习经历的一部分,你只需要慢慢地完成即可。
由于第一个版本的日期很早,无论如何您都必须进行一些重写,只是为了使其达到标准。更高版本可以用作此过程的指南。至少,它应该达到 C99,因为这实际上是 POSIX 的一部分。换句话说,你至少应该跟上本世纪的步伐!
仅仅让它发挥作用的挑战就足够了。只要这样做,你就会学到很多东西。让它投入运行的过程就是建立一个“基线”。一旦你这样做了,你就有了自己的版本,你可以从“分析”开始。
一旦建立了基线,您就可以通过重构和重新设计全力推进。测试套件有助于防止失误和插入错误。您应该将(重新)制作的所有版本保留在本地存储库中,以便您可以跳回到较早的版本,以防您需要跟踪突然出现的测试失败或其他错误。您可能会发现,有些错误从一开始就根深蒂固(因此:发现隐藏的错误)。
当您(重新)编写了令您满意的基线后,您就可以继续在后续版本中进行分层。在 GNU 档案中,1.18 直接跳至 2.05。您必须在两者之间进行“比较”,以查看所有更改都在哪里,然后将它们移植到 1.18 版本中以获得 2.05 版本。这将帮助您更好地了解所做更改解决的问题以及所做的更改。
在某些时候,你会遇到 GNU 的垃圾墙。在 GNU 的历史档案中,版本 2.05 直接跳至 3.01。从 3.01 版本开始,一些纠缠开始出现。所以,我们这里有一堵软墙。但还有一个 3.01 的早期测试套件,您应该将其与 1.18 一起使用,而不是 4.2 的测试套件。
当你撞上垃圾墙时,你会直接看到纠缠是什么,你必须决定是继续顺风车还是把它们扔到一边。我无法告诉你哪个方向是兔子洞,除了 SED 很长一段时间以来都很好,其中大部分或全部是 POSIX 标准(甚至是当前的标准)中列出并强制执行的内容,以及什么是之前的版本 3 就服务于这个目的。
我运行了差异。在 2.05 和 3.01 之间,diff 文件有 5000 行。好的。这(大部分)很好,对于正在开发的代码来说是很自然的,但其中一些可能来自软垃圾墙。在 3.01 与 4.2 上运行 diff 会生成超过 60000 行的 diff 文件。您只需问自己:一个低于 10000 行的程序(遵守国际标准 (POSIX))如何能够产生 60000 行差异?答案是:这就是我们所说的“膨胀”。因此,在 3.01 和 4.2 之间,您会看到一个代码库非常常见的问题:膨胀的增加。
所以,这几乎告诉你哪个方向(“顺其自然”与“把它扔到一边”)是兔子洞。我可能会坚持使用 3.01,并粗略地回顾一下 3.01 和 4.2 之间的差异以及更改日志,以大致了解更改内容,然后就这样保留它,除非可能会找到不同的方法写下他们认为有必要改变的内容,如果改变的理由是有效的话。
我以前做过遗产救援,当时“遗产”这个词还没有出现在大多数人的词汇中,我很快就认出了它的标志性标志。这是一个人可能会经历的过程。
我们已经在一些大型代码库中看到了这种情况。实际上,Wayland 取代 X11 是一次大规模的遗产救援活动。 clang 正在取代 GNU 的 gcc 也可能被认为是这种情况的一个例子。
I had to take a look at "sed" just to see what the problem was; it shouldn't be that big. I looked and I see what the issue is, and I feel like Charleton Heston catching first sight of a broken statue on the beach. All of what I'm about to describe for "sed" might also apply to "tar". But I haven't looked at it (yet).
A lot of GNU code got seriously grunged up - to the point of unmaintainable morbid legacy - for reasons I don't know. I don't know exactly when it happened, maybe late 1990's or early 2000's, but it was like someone flipped a switch and suddenly, all the nice modular mostly self-contained code widgets got massively grunged with all sorts of extraneous entanglements having little or no connection to what the application itself was trying to do.
In your case, "sed": an entire library got (needlessly) dragged in with the application. This was the case at least as early as version 4.2 (the last version predating your query), probably before that - I'd have to check.
Another thing that got grunged up was the build system (again) to the point of unmaintainability.
So, you're really talking about legacy rescue here.
My advice ... which is generic for any codebase that's been around a long time ... is to dig as deep as you can and go back to its earliest forms first; and to branch out and look at other "sed"'s - like those in the UNIX archive.
https://www.tuhs.org/Archive/
or in the BSD archive:
https://github.com/freebsd
https://github.com/weiss/original-bsd
(the second one goes deeper into early BSD in its earlier commits.)
Many of the "sed"'s on the GNU page - but not all of them - may be found under "Downloads" as a link "mirrors" on the GNU sed page:
https://www.gnu.org/software/sed/
Version 1.18 is still intact. Version 1.17 is implicitly intact, since there is a 1.17 to 1.18 diff present there. Neither version has all the extra stuff piled on top of it. It's more representative of what GNU software looked like, before before becoming knotted up with all the entanglements.
It's actually pretty small - only 8863 lines for the *.c and *.h files, in all. Start with that.
For me the process of analysis of any codebase is destructive of the original and always entails a massive amount of refactoring and re-engineering; and simplification coming from just writing it better and more natively, while yet keeping or increasing its functionality. Almost always, it is written by people who only have a few years' experience (by which I mean: less than 20 years, for instance) and have thus not acquired full-fledged native fluency in the language, nor the breadth of background to be able to program well.
For this, if you do the same, it's strongly advised that you have some of test suite already in place or added. There's one in the version 4.2 software, for instance, though it may be stress-testing new capabilities added between 1.18 and 4.2. Just be aware of that. (So, it might require reducing the test suite to fit 1.18.) Every change you make has to be validated by whatever tests you have in your suite.
You need to have native fluency in the language ... or else the willingness and ability to acquire it by carrying out the exercise and others like it. If you don't have enough years behind you, you're going to hit a soft wall. The deeper you go, the harder it might be to move forward. That's an indication that you're not experienced enough yet, and that you don't have enough breadth. So, this exercise then becomes part of your learning experience, and you'll just have to plod through.
Because of how early the first versions date from, you will have to do some rewriting anyhow, just to bring it up to standard. Later versions can be used as a guide, for this process. At a bare minimum, it should be brought up to C99, as this is virtually mandated as part of POSIX. In other words, you should at least be at least as far up to date as the present century!
Just the challenge of getting it to be functional will be exercise enough. You'll learn a lot of what's in it, just by doing that. The process of getting it to be operational is establishing a "baseline". Once you do that, you have your own version, and you can start with the "analysis".
Once a baseline is established, then you can proceed full throttle forward with refactoring and re-engineering. The test suite helps to provide cover against stumbles and inserted errors. You should keep all the versions that you have (re)made in a local repository so that you can jump back to earlier ones, in case you need to track down the sudden emergence of test failures or other bugs. Some bugs, you may find, were rooted all the way back in the beginning (thus: the discovery of hidden bugs).
After you have the baseline (re)written to your satisfaction, then you can proceed to layer in the subsequent versions. On GNU's archive, 1.18 jumps straight to 2.05. You'll have to make a "diff" between the two to see where all the changes were, and then graft them into your version of 1.18 to get your version of 2.05. This will help you better understand both the issues that the changes made addressed, and what changes were made.
At some point you're going to hit GNU's Grunge Wall. Version 2.05 jumped straight to 3.01 in GNU's historical archive. Some entanglements started slipping in with version 3.01. So, it's a soft wall we have here. But there's also an early test suite with 3.01, which you should use with 1.18, instead of 4.2's test suite.
When you hit the Grunge Wall, you'll see directly what the entanglements were, and you'll have to decide whether to go along for the ride or cast them aside. I can't tell you which direction is the rabbit hole, except that SED has been perfectly fine for a long time, most or all of it is what is listed in and mandated by the POSIX standard (even the current one), and what's there before version 3 serves that end.
I ran diffs. Between 2.05 and 3.01, the diff file is 5000 lines. Ok. That's (mostly) fine and is natural for code that's in development, but some of that may be coming from the soft Grunge Wall. Running a diff on 3.01 versus 4.2 yields a diff file that over 60000 lines. You need only ask yourself: how can a program that's under 10000 lines - that abides by an international standard (POSIX) - be producing 60000 lines of differences? The answer is: that's what we call bloat. So, between 3.01 and 4.2, you're witnessing a problem that is very common to code bases: the rise of bloat.
So, that pretty much tells you which direction ("go along for the ride" versus "cast it aside") is the rabbit hole. I'd probably just stick with 3.01, and do a cursory review of the differences between 3.01 and 4.2 and of the change logs to get an overview of what the changes were, and just leave it at that, except maybe to find a different way to write in what they thought was necessary to change, if the reason for it was valid.
I've done legacy rescue before, before the term "legacy" was even in most people's vocabulary and am quick to recognize the hallmark signs of it. This is the kind of process one might go through.
We've seen it happen with some large codebases already. In effect, the superseding of X11 by Wayland was a massive exercise in legacy rescue. It's also possible that the ongoing superseding of GNU's gcc by clang may be considered instance of that.