关于最小化代码和最大化数据哲学的思考
我听说过最小化代码和最大化数据的概念,并且想知道其他人可以就我在构建自己的系统时如何/为什么应该这样做提供什么建议?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
我听说过最小化代码和最大化数据的概念,并且想知道其他人可以就我在构建自己的系统时如何/为什么应该这样做提供什么建议?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(6)
通常,数据驱动的代码更易于阅读和维护。我知道我见过数据驱动被推向极端并且最终变得非常不可用的情况(我正在考虑我使用过的一些 SAP 部署),但是编写您自己的“领域特定语言”来帮助您构建您的软件通常可以节省大量时间。
在我看来,务实的程序员仍然是编写我所编写的小语言的最生动的倡导者。已读过。运行少量输入语言的小型状态机可以用很少的空间完成很多工作,并且可以轻松进行修改。
一个具体的例子:考虑累进所得税制度,税级为 1,000 美元、10,000 美元和 100,000 美元。低于 1,000 美元的收入不征税。收入在 1,000 美元到 9,999 美元之间的税率为 10%。收入在 10,000 美元到 99,999 美元之间的税率为 20%。收入超过 100,000 美元的税率为 30%。如果您将这些全部写在代码中,它看起来就像您怀疑的那样:
添加新的税级、更改现有的税级或更改括号中的税负,都需要修改代码并重新编译。
但如果它是数据驱动的,您可以将此表存储在配置文件中:
编写一个小工具来解析此表并进行查找(不是很困难,对吧?)现在任何人都可以轻松地维护税率表。如果国会决定 1000 个括号会更好,任何人都可以使表格与 IRS 表格对齐,然后就可以完成它,无需重新编译代码。相同的通用代码可用于一个括号或数百个括号。
现在来做一些不太明显的事情:测试。 AppArmor 项目针对加载各种配置文件时系统调用应执行的操作进行了数百项测试。一个示例测试如下所示:
它依赖于一些辅助函数来生成和加载配置文件、测试函数的结果并向用户报告。扩展这些小测试脚本比不用一点语言编写此类功能要容易得多。是的,这些是 shell 脚本,但它们与实际的 shell 脚本相距甚远;)以至于它们实际上是数据。
我希望这有助于激发数据驱动编程;恐怕我不像其他写过相关文章的人那样雄辩,而且我当然还没有变得擅长,但我会尝试。
Typically data-driven code is easier to read and maintain. I know I've seen cases where data-driven has been taken to the extreme and winds up very unusable (I'm thinking of some SAP deployments I've used), but coding your own "Domain Specific Languages" to help you build your software is typically a huge time saver.
The pragmatic programmers remain in my mind the most vivid advocates of writing little languages that I have read. Little state machines that run little input languages can get a lot accomplished with very little space, and make it easy to make modifications.
A specific example: consider a progressive income tax system, with tax brackets at $1,000, $10,000, and $100,000 USD. Income below $1,000 is untaxed. Income between $1,000 and $9,999 is taxed at 10%. Income between $10,000 and $99,999 is taxed at 20%. And income above $100,000 is taxed at 30%. If you were write this all out in code, it'd look about as you suspect:
Adding new tax brackets, changing the existing brackets, or changing the tax burden in the brackets, would all require modifying the code and recompiling.
But if it were data-driven, you could store this table in a configuration file:
Write a little tool to parse this table and do the lookups (not very difficult, right?) and now anyone can easily maintain the tax rate tables. If congress decides that 1000 brackets would be better, anyone could make the tables line up with the IRS tables, and be done with it, no code recompiling necessary. The same generic code could be used for one bracket or hundreds of brackets.
And now for something that is a little less obvious: testing. The AppArmor project has hundreds of tests for what system calls should do when various profiles are loaded. One sample test looks like this:
It relies on some helper functions to generate and load profiles, test the results of the functions, and report back to users. It is far easier to extend these little test scripts than it is to write this sort of functionality without a little language. Yes, these are shell scripts, but they are so far removed from actual shell scripts ;) that they are practically data.
I hope this helps motivate data-driven programming; I'm afraid I'm not as eloquent as others who have written about it, and I certainly haven't gotten good at it, but I try.
在现代软件中,代码和数据之间的界限可能变得非常细和模糊,并且区分两者并不总是那么容易。毕竟,就计算机而言,一切都是数据,除非它是由现有代码(通常是操作系统)确定的。即使程序也必须作为数据加载到内存中,然后 CPU 才能执行它们。
例如,想象一个计算订单成本的算法,其中订单越大,每件商品的价格就越低。它是商店中用 C 语言编写的大型软件系统的一部分。
该算法用 C 语言编写,并读取一个文件,该文件包含管理层提供的输入表,其中包含各种每件商品的价格和相应的订单大小阈值。大多数人会认为带有简单输入表的文件当然是数据。
现在,想象一下商店将其策略更改为某种渐近函数,而不是预先选择的阈值,以便它可以容纳非常大的订单。他们可能还想考虑汇率和通货膨胀——或者管理层提出的任何其他因素。
商店聘请了一位称职的程序员,她在原始 C 代码中嵌入了一个很好的数学表达式解析器。输入文件现在包含一个带有全局变量的表达式、
log()
和tan()
等函数,以及一些简单的内容,例如 普朗克常数 和碳14降解。大多数人仍然会争辩说,即使表达式不像表格那么简单,但实际上是数据。毕竟它可能是由管理层按原样提供的。
这家商店收到了大量顾客的投诉,他们在估算开支时陷入了脑死亡,会计人员也收到了大量关于大量零钱的投诉。商店决定针对小订单返回表格,并针对大订单使用斐波那契数列。
程序员厌倦了修改和重新编译 C 代码,因此她嵌入了一个 Python 解释器。输入文件现在包含一个 Python 函数,该函数会轮询满屋子的 Fib(n) 猴子,以了解大订单的成本。
问题:这是输入文件数据吗?
从严格的技术角度来看,没有什么不同。表和表达式都需要在使用前进行解析。数学表达式解析器可能支持分支和函数 - 它可能不是图灵完备的,但它仍然使用自己的语言(例如 MathML)。
然而现在许多人会认为输入文件只是变成了代码。
那么将输入格式从数据转换为代码的显着特征是什么?
可修改性:必须重新编译整个系统才能实现更改非常很好地表明了以代码为中心的系统。然而,我可以很容易地想象(嗯,更像是我实际上见过)软件的设计不够完善,以至于在编译时内置了一个输入表。我们不要忘记,许多应用程序仍然有图标 - 大多数人会认为数据 - 内置在可执行文件中。
输入格式:在我看来,这是人们天真地考虑的最常见因素:“如果它是编程语言,那么它就是代码” 。好吧,C 是代码——毕竟你必须编译它。我也同意 Python 也是代码 - 它是一种成熟的语言。那么为什么不是 XML/XSL 代码呢? XSL 本身就是一种相当复杂的语言 - 因此
L
在我看来,这两个标准都不是真正的区别特征。我认为人们应该考虑其他事情:
当然,这意味着系统是否是数据驱动的,至少应该考虑与目标受众的关系 - 如果不是根据具体情况与客户相关的话。
这也意味着这种区别可能会受到可用工具集的影响。 UML 规范对于我们来说是一场噩梦,但如今我们拥有所有这些图形 UML 编辑器帮助我们。如果有某种第三方高级人工智能工具可以解析自然语言并生成 XML/Python 等,那么即使对于更复杂的输入,系统也会变得数据驱动。
小商店可能没有聘请第三方的专业知识或资源。因此,允许员工通过普通管理课程中获得的知识(数学、图表等)来改变自己的行为,对于这些受众来说可以被认为是充分的数据驱动。
另一方面,一家价值数十亿美元的跨国公司的工资中通常有一群 IT 专家和网页设计师。因此,XML/XSL、Javascript,甚至 Python 和 PHP 可能都很容易处理。它还具有足够复杂的要求,简单的东西可能无法满足要求。
我相信,在设计软件系统时,应该努力在所使用的输入格式中实现良好的平衡,使目标受众可以做他们需要的事情,而不必经常拜访第三方。
应该指出的是,外包更加模糊了界限。有相当多的问题,目前的技术根本不允许外行人可以解决。在这种情况下,解决方案的目标受众可能应被视为将操作外包给的第三方。
预计该第三方将雇用相当数量的专家。
In modern software the line between code and data can become awfully thin and blurry, and it is not always easy to tell the two apart. After all, as far as the computer is concerned, everything is data, unless it is determined by existing code - normally the OS - to be otherwise. Even programs have to be loaded into memory as data, before the CPU can execute them.
For example, imagine an algorithm that computes the cost of an order, where larger orders get lower prices per item. It is part of a larger software system in a store, written in C.
This algorithm is written in C and reads a file that contains an input table provided by the management with the various per-item prices and the corresponding order size thresholds. Most people would argue that a file with a simple input table is, of course, data.
Now, imagine that the store changes its policy to some sort of asymptotic function, rather than pre-selected thresholds, so that it can accommodate insanely large orders. They might also want to factor in exchange rates and inflation - or whatever else the management people come up with.
The store hires a competent programmer and she embeds a nice mathematical expression parser in the original C code. The input file now contains an expression with global variables, functions such as
log()
andtan()
, as well as some simple stuff like the Planck constant and the rate of carbon-14 degradation.Most people would still argue that the expression, even if not as simple as a table, is in fact data. After all it is probably provided as-is by the management.
The store receives a large amount of complaints from clients that became brain-dead trying to estimate their expenses and from the accounting people about the large amount of loose change. The store decides to go back to the table for small orders and use a Fibonacci sequence for larger orders.
The programmer gets tired of modifying and recompiling the C code, so she embeds a Python interpretter instead. The input file now contains a Python function that polls a roomfull of
Fib(n)
monkeys for the cost of large orders.Question: Is this input file data?
From a strict technical point, there is nothing different. Both the table and the expression needed to be parsed before usage. The mathematical expression parser probably supported branching and functions - it might not have been Turing-complete, but it still used a language of its own (e.g. MathML).
Yet now many people would argue that the input file just became code.
So what is the distinguishing feature that turns the input format from data into code?
Modifiability: Having to recompile the whole system to effect a change is a very good indication of a code-centric system. Yet I can easily imagine (well, more like I have actually seen) software that has been designed incompetently enough to have e.g. an input table built-in at compile time. And let's not forget that many applications still have icons - that most people would deem data - built in their executables.
Input format: This is the - in my opinion, naively - most common factor that people consider: "If it is in a programming language then it is code". Fine, C is code - you have to compile it after all. I would also agree that Python is also code - it is a full blown language. So why isn't XML/XSL code? XSL is a quite complex language in its own right - hence the
L
in its name.In my opinion, none of these two criteria is the actual distinguishing feature. I think that people should consider something else:
This, of course, means that whether a system is data-driven or not should be considered at least in relation to the target audience - if not in relation to the client on a case-by-case basis.
It also means that the distinction can be impacted by the available toolset. The UML specification is a nightmare to go through, but these days we have all those graphical UML editors to help us. If there was some kind of third-party high-level AI tool that parses natural language and produces XML/Python/whatever, then the system becomes data-driven even for far more complex input.
A small store probably does not have the expertise or the resources to hire a third party. So, something that allows the workers to modify its behaviour with the knowledge that one would get in an average management course - mathematics, charts etc - could be considered sufficiently data-driven for this audience.
On the other hand, a multi-billion international corporation usually has in its payroll a bunch of IT specialists and Web designers. Therefore, XML/XSL, Javascript, or even Python and PHP are probably easy enough for it to handle. It also has complex enough requirements that something simpler might just not cut it.
I believe that when designing a software system, one should strive to achieve that fine balance in the used input formats where the target audience can do what they need to, without having to frequently call on third parties.
It should be noted that outsourcing blurs the lines even more. There are quite a few issues, for which the current technology simply does not allow the solution to be approachable by the layman. In that case the target audience of the solution should probably be considered to be the third party to which the operation would be outsourced to.
That third party can be expected to employ a fair number of experts.
Unix 哲学下的五条格言之一,由 Rob Pike 是这样的:
数据占主导地位。如果您选择了正确的数据结构并很好地组织了事物,那么算法几乎总是不言而喻的。编程的核心是数据结构,而不是算法。
它通常被缩写为“编写使用智能数据的愚蠢代码”。
One of five maxims under the Unix Philosophy, as presented by Rob Pike, is this:
Data dominates. If you have chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
It is often shortened to, "write stupid code that uses smart data."
其他答案已经深入研究了如何使用仅对其特定输入模式做出反应的简单代码来编写复杂的行为。您可以将数据视为一种特定于领域的语言,并将代码视为解释器(也许是一个简单的解释器)。
有了大量数据,您就可以走得更远:统计数据可以为决策提供支持。 Peter Norvig 在 伟大的章节来说明这一主题/9780596157111/" rel="noreferrer">美丽的数据,文本、代码和数据均可在线获取。 (披露:我在致谢中表示感谢。)第 238-239 页:
他使用 Google 收集的数据集,通过 Python 代码具体地展示了这一点。除了拼写纠正之外,还有用于分段单词和破译密码的代码——同样只需要几页,其中 Grady Booch 的书花了几十块钱还没读完。
“数据的不合理有效性”更广泛地发展了同一主题,但没有提供所有具体细节。
我在另一家搜索公司的工作中采用了这种方法,我认为与表驱动/DSL 编程相比,它仍然没有得到充分利用,因为我们大多数人直到最近一两年才如此频繁地接触数据。
Other answers have already dug into how you can often code complex behavior with simple code that just reacts to the pattern of its particular input. You can think of the data as a domain-specific language, and of your code as an interpreter (maybe a trivial one).
Given lots of data you can go further: the statistics can power decisions. Peter Norvig wrote a great chapter illustrating this theme in Beautiful Data, with text, code, and data all available online. (Disclosure: I'm thanked in the acknowledgements.) On pp. 238-239:
He shows this concretely with code in Python using a dataset collected at Google. Besides spelling correction, there's code to segment words and to decipher cryptograms -- in just a couple pages, again, where Grady Booch's book spent dozens without even finishing it.
"The Unreasonable Effectiveness of Data" develops the same theme more broadly, without all the nuts and bolts.
I've taken this approach in my work for another search company and I think it's still underexploited compared to table-driven/DSL programming, because most of us weren't swimming in data so much until the last decade or two.
在代码可以被视为数据的语言中,这不是问题。您可以根据解决方案的需要,使用清晰、简短且可维护的内容,并倾向于数据、代码、功能、面向对象或过程。
在过程中,区别是明显的,我们倾向于将数据视为以特定方式存储的某种东西,但即使在过程中,最好隐藏 API 后面的数据,或者 OO 中的对象后面的数据。
lookup(avalue)
在其生命周期内可以通过多种不同的方式重新实现,只要它作为函数启动即可。...我一直在为不存在的机器设计程序并添加:“如果我们现在拥有一台包含此处假设的原语的机器,那么工作就完成了。”
...当然,在实际实践中,这种理想的机器将不存在,因此我们的下一个任务(结构上与原始任务类似)是对“上层”机器的模拟进行编程...但是这个一堆程序是为一台很可能不存在的机器编写的,所以我们的下一个工作将是用下一个较低级别机器的程序来模拟它,等等,直到最后我们有一个可以执行的程序我们的硬件...
EW Dijkstra,结构化编程笔记,1969,John Allen 引用, Lisp 剖析,1978。
In languages in which code can be treated as data it is a non-issue. You use what's clear, brief, and maintainable, leaning towards data, code, functional, OO, or procedural, as the solution requires.
In procedural, the distinction is marked, and we tend to think about data as something stored in an specific way, but even in procedural it is best to hide the data behind an API, or behind an object in OO.
A
lookup(avalue)
can be reimplemented in many different ways during its lifetime, as long as its starts as a function....All the time I desing programs for nonexisting machines and add: 'if we now had a machine comprising the primitives here assumed, then the job is done.'
... In actual practice, of course, this ideal machine will turn out not to exist, so our next task --structurally similar to the original one-- is to program the simulation of the "upper" machine... But this bunch of programs is written for a machine that in all probability will not exist, so our next job will be to simulate it in terms of programs for a next lower level machine, etc., until finally we have a program that can be executed by our hardware...
E. W. Dijkstra in Notes on Structured Programming, 1969, as quoted by John Allen, in Anatomy of Lisp, 1978.
当我想到这个我非常同意的哲学时,首先想到的是代码效率。
当我编写代码时,我确信它并不总是接近完美甚至完全知识渊博。足够了解如何在需要时使机器接近最大效率,并在其余时间保持良好的效率(也许要权衡更好的工作流程),这使我能够生产出高质量的成品。
以数据驱动的方式编码,您最终会使用代码来实现代码的用途。将每个变量“外包”到文件将是愚蠢和极端的,程序的功能需要在程序中,并且内容、设置和其他因素可以由程序管理。
这还允许更多的动态应用程序和新功能。
即使您有一个简单形式的数据库,您也可以将相同的功能应用于许多状态。您还可以做各种创造性的事情,例如根据文件头数据或目录、文件名或扩展名更改程序正在执行的操作的上下文,尽管并非所有数据都必须存储在文件系统上。
最后,将代码保持在仅处理数据的状态会让您更接近于想象实际发生的情况。这也可以避免代码中的大部分内容,从而大大减少臃肿软件。
我相信它使代码更易于维护、更灵活、更高效,我喜欢它。
也感谢其他人对此提出的意见!我发现这非常令人鼓舞。
When I think of this philosophy which I agree with quite a bit, the first thing that comes to mind is code efficiency.
When I'm making code I know for sure it isn't always anything close to perfect or even fully knowledgeable. Knowing enough to get close to maximum efficiency out of a machine when it is needed and good efficiency the rest of the time (perhaps trading off for better workflow) has allowed me to produce high quality finished products.
Coding in a data driven way, you end up using code for what code is for. To go and 'outsource' every variable to files would be foolishly extreme, the functionality of a program needs to be in the program and the content, settings and other factors can be managed by the program.
This also allows for much more dynamic applications and new features.
If you have even a simple form of database, you are able to apply the same functionality to many states. You may also do all manner of creative things like changing the context of what your program is doing based on file header data or perhaps directory, file name or extension, though not all data is necessarily stored on a filesystem.
Finally keeping your code in a state where it is simply handling data puts you in a state of mind where you are closer to envisioning what is actually going on. This also keeps the bulk out of your code, greatly reducing bloatware.
I believe it makes code more maintainable, more flexible and more efficient aaaand I like it.
Thank you to the others for your input on this as well! I found it very encouraging.