关于最小化代码和最大化数据哲学的思考

走野 2024-10-13 15:42:57

通常，数据驱动的代码更易于阅读和维护。我知道我见过数据驱动被推向极端并且最终变得非常不可用的情况（我正在考虑我使用过的一些 SAP 部署），但是编写您自己的“领域特定语言”来帮助您构建您的软件通常可以节省大量时间。

在我看来，务实的程序员仍然是编写我所编写的小语言的最生动的倡导者。已读过。运行少量输入语言的小型状态机可以用很少的空间完成很多工作，并且可以轻松进行修改。

一个具体的例子：考虑累进所得税制度，税级为 1,000 美元、10,000 美元和 100,000 美元。低于 1,000 美元的收入不征税。收入在 1,000 美元到 9,999 美元之间的税率为 10%。收入在 10,000 美元到 99,999 美元之间的税率为 20%。收入超过 100,000 美元的税率为 30%。如果您将这些全部写在代码中，它看起来就像您怀疑的那样：

total_tax_burden(income) {
    if (income < 1000)
        return 0
    if (income < 10000)
        return .1 * (income - 1000)
    if (income < 100000)
        return 999.9 + .2 * (income - 10000)
    return 18999.7 + .3 * (income - 100000)
}

添加新的税级、更改现有的税级或更改括号中的税负，都需要修改代码并重新编译。

但如果它是数据驱动的，您可以将此表存储在配置文件中：

1000:0
10000:10
100000:20
inf:30

编写一个小工具来解析此表并进行查找（不是很困难，对吧？）现在任何人都可以轻松地维护税率表。如果国会决定 1000 个括号会更好，任何人都可以使表格与 IRS 表格对齐，然后就可以完成它，无需重新编译代码。相同的通用代码可用于一个括号或数百个括号。

现在来做一些不太明显的事情：测试。 AppArmor 项目针对加载各种配置文件时系统调用应执行的操作进行了数百项测试。一个示例测试如下所示：

#! /bin/bash
# $Id$

#   Copyright (C) 2002-2007 Novell/SUSE
#
#   This program is free software; you can redistribute it and/or
#   modify it under the terms of the GNU General Public License as
#   published by the Free Software Foundation, version 2 of the
#   License.

#=NAME open
#=DESCRIPTION 
# Verify that the open syscall is correctly managed for confined profiles.  
#=END

pwd=`dirname $0`
pwd=`cd $pwd ; /bin/pwd`

bin=$pwd

. $bin/prologue.inc

file=$tmpdir/file
okperm=rw
badperm1=r
badperm2=w

# PASS UNCONFINED
runchecktest "OPEN unconfined RW (create) " pass $file

# PASS TEST (the file shouldn't exist, so open should create it
rm -f ${file}
genprofile $file:$okperm
runchecktest "OPEN RW (create) " pass $file

# PASS TEST
genprofile $file:$okperm
runchecktest "OPEN RW" pass $file

# FAILURE TEST (1)
genprofile $file:$badperm1
runchecktest "OPEN R" fail $file

# FAILURE TEST (2)
genprofile $file:$badperm2
runchecktest "OPEN W" fail $file

# FAILURE TEST (3)
genprofile $file:$badperm1 cap:dac_override
runchecktest "OPEN R+dac_override" fail $file

# FAILURE TEST (4)
# This is testing for bug: https://bugs.wirex.com/show_bug.cgi?id=2885
# When we open O_CREAT|O_RDWR, we are (were?) allowing only write access
# to be required.
rm -f ${file}
genprofile $file:$badperm2
runchecktest "OPEN W (create)" fail $file

它依赖于一些辅助函数来生成和加载配置文件、测试函数的结果并向用户报告。扩展这些小测试脚本比不用一点语言编写此类功能要容易得多。是的，这些是 shell 脚本，但它们与实际的 shell 脚本相距甚远；）以至于它们实际上是数据。

我希望这有助于激发数据驱动编程；恐怕我不像其他写过相关文章的人那样雄辩，而且我当然还没有变得擅长，但我会尝试。

Typically data-driven code is easier to read and maintain. I know I've seen cases where data-driven has been taken to the extreme and winds up very unusable (I'm thinking of some SAP deployments I've used), but coding your own "Domain Specific Languages" to help you build your software is typically a huge time saver.

The pragmatic programmers remain in my mind the most vivid advocates of writing little languages that I have read. Little state machines that run little input languages can get a lot accomplished with very little space, and make it easy to make modifications.

A specific example: consider a progressive income tax system, with tax brackets at $1,000, $10,000, and $100,000 USD. Income below $1,000 is untaxed. Income between $1,000 and $9,999 is taxed at 10%. Income between $10,000 and $99,999 is taxed at 20%. And income above $100,000 is taxed at 30%. If you were write this all out in code, it'd look about as you suspect:

total_tax_burden(income) {
    if (income < 1000)
        return 0
    if (income < 10000)
        return .1 * (income - 1000)
    if (income < 100000)
        return 999.9 + .2 * (income - 10000)
    return 18999.7 + .3 * (income - 100000)
}

Adding new tax brackets, changing the existing brackets, or changing the tax burden in the brackets, would all require modifying the code and recompiling.

But if it were data-driven, you could store this table in a configuration file:

1000:0
10000:10
100000:20
inf:30

Write a little tool to parse this table and do the lookups (not very difficult, right?) and now anyone can easily maintain the tax rate tables. If congress decides that 1000 brackets would be better, anyone could make the tables line up with the IRS tables, and be done with it, no code recompiling necessary. The same generic code could be used for one bracket or hundreds of brackets.

And now for something that is a little less obvious: testing. The AppArmor project has hundreds of tests for what system calls should do when various profiles are loaded. One sample test looks like this:

#! /bin/bash
# $Id$

#   Copyright (C) 2002-2007 Novell/SUSE
#
#   This program is free software; you can redistribute it and/or
#   modify it under the terms of the GNU General Public License as
#   published by the Free Software Foundation, version 2 of the
#   License.

#=NAME open
#=DESCRIPTION 
# Verify that the open syscall is correctly managed for confined profiles.  
#=END

pwd=`dirname $0`
pwd=`cd $pwd ; /bin/pwd`

bin=$pwd

. $bin/prologue.inc

file=$tmpdir/file
okperm=rw
badperm1=r
badperm2=w

# PASS UNCONFINED
runchecktest "OPEN unconfined RW (create) " pass $file

# PASS TEST (the file shouldn't exist, so open should create it
rm -f ${file}
genprofile $file:$okperm
runchecktest "OPEN RW (create) " pass $file

# PASS TEST
genprofile $file:$okperm
runchecktest "OPEN RW" pass $file

# FAILURE TEST (1)
genprofile $file:$badperm1
runchecktest "OPEN R" fail $file

# FAILURE TEST (2)
genprofile $file:$badperm2
runchecktest "OPEN W" fail $file

# FAILURE TEST (3)
genprofile $file:$badperm1 cap:dac_override
runchecktest "OPEN R+dac_override" fail $file

# FAILURE TEST (4)
# This is testing for bug: https://bugs.wirex.com/show_bug.cgi?id=2885
# When we open O_CREAT|O_RDWR, we are (were?) allowing only write access
# to be required.
rm -f ${file}
genprofile $file:$badperm2
runchecktest "OPEN W (create)" fail $file

It relies on some helper functions to generate and load profiles, test the results of the functions, and report back to users. It is far easier to extend these little test scripts than it is to write this sort of functionality without a little language. Yes, these are shell scripts, but they are so far removed from actual shell scripts ;) that they are practically data.

I hope this helps motivate data-driven programming; I'm afraid I'm not as eloquent as others who have written about it, and I certainly haven't gotten good at it, but I try.

回复收藏 0 原文

谎言 2024-10-13 15:42:57

在现代软件中，代码和数据之间的界限可能变得非常细和模糊，并且区分两者并不总是那么容易。毕竟，就计算机而言，一切都是数据，除非它是由现有代码（通常是操作系统）确定的。即使程序也必须作为数据加载到内存中，然后 CPU 才能执行它们。

例如，想象一个计算订单成本的算法，其中订单越大，每件商品的价格就越低。它是商店中用 C 语言编写的大型软件系统的一部分。

该算法用 C 语言编写，并读取一个文件，该文件包含管理层提供的输入表，其中包含各种每件商品的价格和相应的订单大小阈值。大多数人会认为带有简单输入表的文件当然是数据。

现在，想象一下商店将其策略更改为某种渐近函数，而不是预先选择的阈值，以便它可以容纳非常大的订单。他们可能还想考虑汇率和通货膨胀——或者管理层提出的任何其他因素。

商店聘请了一位称职的程序员，她在原始 C 代码中嵌入了一个很好的数学表达式解析器。输入文件现在包含一个带有全局变量的表达式、log() 和 tan() 等函数，以及一些简单的内容，例如普朗克常数和碳14降解。

cost = (base * ordered * exchange * ... + ... / ...)^13

大多数人仍然会争辩说，即使表达式不像表格那么简单，但实际上是数据。毕竟它可能是由管理层按原样提供的。

这家商店收到了大量顾客的投诉，他们在估算开支时陷入了脑死亡，会计人员也收到了大量关于大量零钱的投诉。商店决定针对小订单返回表格，并针对大订单使用斐波那契数列。

程序员厌倦了修改和重新编译 C 代码，因此她嵌入了一个 Python 解释器。输入文件现在包含一个 Python 函数，该函数会轮询满屋子的 Fib(n) 猴子，以了解大订单的成本。

问题：这是输入文件数据吗？

从严格的技术角度来看，没有什么不同。表和表达式都需要在使用前进行解析。数学表达式解析器可能支持分支和函数 - 它可能不是图灵完备的，但它仍然使用自己的语言（例如 MathML）。

然而现在许多人会认为输入文件只是变成了代码。

那么将输入格式从数据转换为代码的显着特征是什么？

可修改性：必须重新编译整个系统才能实现更改非常很好地表明了以代码为中心的系统。然而，我可以很容易地想象（嗯，更像是我实际上见过）软件的设计不够完善，以至于在编译时内置了一个输入表。我们不要忘记，许多应用程序仍然有图标 - 大多数人会认为数据 - 内置在可执行文件中。
输入格式：在我看来，这是人们天真地考虑的最常见因素：“如果它是编程语言，那么它就是代码” 。好吧，C 是代码——毕竟你必须编译它。我也同意 Python 也是代码 - 它是一种成熟的语言。那么为什么不是 XML/XSL 代码呢？ XSL 本身就是一种相当复杂的语言 - 因此 L

在我看来，这两个标准都不是真正的区别特征。我认为人们应该考虑其他事情：

可维护性简而言之，如果系统的用户必须聘请第三方来提供专业知识 > 需要修改可用系统的行为，那么系统应该在某种程度上被视为以代码为中心。

当然，这意味着系统是否是数据驱动的，至少应该考虑与目标受众的关系 - 如果不是根据具体情况与客户相关的话。

这也意味着这种区别可能会受到可用工具集的影响。 UML 规范对于我们来说是一场噩梦，但如今我们拥有所有这些图形 UML 编辑器帮助我们。如果有某种第三方高级人工智能工具可以解析自然语言并生成 XML/Python 等，那么即使对于更复杂的输入，系统也会变得数据驱动。

小商店可能没有聘请第三方的专业知识或资源。因此，允许员工通过普通管理课程中获得的知识（数学、图表等）来改变自己的行为，对于这些受众来说可以被认为是充分的数据驱动。

另一方面，一家价值数十亿美元的跨国公司的工资中通常有一群 IT 专家和网页设计师。因此，XML/XSL、Javascript，甚至 Python 和 PHP 可能都很容易处理。它还具有足够复杂的要求，简单的东西可能无法满足要求。

我相信，在设计软件系统时，应该努力在所使用的输入格式中实现良好的平衡，使目标受众可以做他们需要的事情，而不必经常拜访第三方。

应该指出的是，外包更加模糊了界限。有相当多的问题，目前的技术根本不允许外行人可以解决。在这种情况下，解决方案的目标受众可能应被视为将操作外包给的第三方。
预计该第三方将雇用相当数量的专家。

In modern software the line between code and data can become awfully thin and blurry, and it is not always easy to tell the two apart. After all, as far as the computer is concerned, everything is data, unless it is determined by existing code - normally the OS - to be otherwise. Even programs have to be loaded into memory as data, before the CPU can execute them.

For example, imagine an algorithm that computes the cost of an order, where larger orders get lower prices per item. It is part of a larger software system in a store, written in C.

This algorithm is written in C and reads a file that contains an input table provided by the management with the various per-item prices and the corresponding order size thresholds. Most people would argue that a file with a simple input table is, of course, data.

Now, imagine that the store changes its policy to some sort of asymptotic function, rather than pre-selected thresholds, so that it can accommodate insanely large orders. They might also want to factor in exchange rates and inflation - or whatever else the management people come up with.

The store hires a competent programmer and she embeds a nice mathematical expression parser in the original C code. The input file now contains an expression with global variables, functions such as log() and tan(), as well as some simple stuff like the Planck constant and the rate of carbon-14 degradation.

cost = (base * ordered * exchange * ... + ... / ...)^13

Most people would still argue that the expression, even if not as simple as a table, is in fact data. After all it is probably provided as-is by the management.

The store receives a large amount of complaints from clients that became brain-dead trying to estimate their expenses and from the accounting people about the large amount of loose change. The store decides to go back to the table for small orders and use a Fibonacci sequence for larger orders.

The programmer gets tired of modifying and recompiling the C code, so she embeds a Python interpretter instead. The input file now contains a Python function that polls a roomfull of Fib(n) monkeys for the cost of large orders.

Question: Is this input file data?

From a strict technical point, there is nothing different. Both the table and the expression needed to be parsed before usage. The mathematical expression parser probably supported branching and functions - it might not have been Turing-complete, but it still used a language of its own (e.g. MathML).

Yet now many people would argue that the input file just became code.

So what is the distinguishing feature that turns the input format from data into code?

Modifiability: Having to recompile the whole system to effect a change is a very good indication of a code-centric system. Yet I can easily imagine (well, more like I have actually seen) software that has been designed incompetently enough to have e.g. an input table built-in at compile time. And let's not forget that many applications still have icons - that most people would deem data - built in their executables.
Input format: This is the - in my opinion, naively - most common factor that people consider: "If it is in a programming language then it is code". Fine, C is code - you have to compile it after all. I would also agree that Python is also code - it is a full blown language. So why isn't XML/XSL code? XSL is a quite complex language in its own right - hence the L in its name.

In my opinion, none of these two criteria is the actual distinguishing feature. I think that people should consider something else:

Maintainability: In short, if the user of the system has to hire a third party to make the expertise needed to modify the behaviour of the system available, then the system should be considered code-centric to a degree.

This, of course, means that whether a system is data-driven or not should be considered at least in relation to the target audience - if not in relation to the client on a case-by-case basis.

It also means that the distinction can be impacted by the available toolset. The UML specification is a nightmare to go through, but these days we have all those graphical UML editors to help us. If there was some kind of third-party high-level AI tool that parses natural language and produces XML/Python/whatever, then the system becomes data-driven even for far more complex input.

A small store probably does not have the expertise or the resources to hire a third party. So, something that allows the workers to modify its behaviour with the knowledge that one would get in an average management course - mathematics, charts etc - could be considered sufficiently data-driven for this audience.

On the other hand, a multi-billion international corporation usually has in its payroll a bunch of IT specialists and Web designers. Therefore, XML/XSL, Javascript, or even Python and PHP are probably easy enough for it to handle. It also has complex enough requirements that something simpler might just not cut it.

I believe that when designing a software system, one should strive to achieve that fine balance in the used input formats where the target audience can do what they need to, without having to frequently call on third parties.

It should be noted that outsourcing blurs the lines even more. There are quite a few issues, for which the current technology simply does not allow the solution to be approachable by the layman. In that case the target audience of the solution should probably be considered to be the third party to which the operation would be outsourced to.
That third party can be expected to employ a fair number of experts.

回复收藏 0 原文

怪异←思 2024-10-13 15:42:57

Unix 哲学下的五条格言之一，由 Rob Pike 是这样的：

数据占主导地位。如果您选择了正确的数据结构并很好地组织了事物，那么算法几乎总是不言而喻的。编程的核心是数据结构，而不是算法。

它通常被缩写为“编写使用智能数据的愚蠢代码”。

回复收藏 0 原文

固执像三岁 2024-10-13 15:42:57

其他答案已经深入研究了如何使用仅对其特定输入模式做出反应的简单代码来编写复杂的行为。您可以将数据视为一种特定于领域的语言，并将代码视为解释器（也许是一个简单的解释器）。

有了大量数据，您就可以走得更远：统计数据可以为决策提供支持。 Peter Norvig 在伟大的章节来说明这一主题/9780596157111/" rel="noreferrer">美丽的数据，文本、代码和数据均可在线获取。（披露：我在致谢中表示感谢。）第 238-239 页：

数据驱动方法与更传统的软件开发相比如何
程序员编写明确规则的过程？ ...显然，手写规则很难开发和维护。大的
数据驱动方法的优点是大量的知识被编码在数据中，
只要收集更多的数据就可以添加新的知识。但另一个好处是
虽然数据可能很大，但代码很简洁——正确大约需要 50 行，而 ht://Dig 的拼写代码则超过 1,500 行。 ...

另一个问题是可移植性。如果我们想要一个拉脱维亚语拼写校正器，那么英语
变音位规则没什么用处。将数据驱动的正确算法移植到另一个算法
语言，我们需要的只是一个拉脱维亚语的大型语料库；代码保持不变。

他使用 Google 收集的数据集，通过 Python 代码具体地展示了这一点。除了拼写纠正之外，还有用于分段单词和破译密码的代码——同样只需要几页，其中 Grady Booch 的书花了几十块钱还没读完。

“数据的不合理有效性”更广泛地发展了同一主题，但没有提供所有具体细节。

我在另一家搜索公司的工作中采用了这种方法，我认为与表驱动/DSL 编程相比，它仍然没有得到充分利用，因为我们大多数人直到最近一两年才如此频繁地接触数据。

回复收藏 0 原文

寄与心 2024-10-13 15:42:57

在代码可以被视为数据的语言中，这不是问题。您可以根据解决方案的需要，使用清晰、简短且可维护的内容，并倾向于数据、代码、功能、面向对象或过程。

在过程中，区别是明显的，我们倾向于将数据视为以特定方式存储的某种东西，但即使在过程中，最好隐藏 API 后面的数据，或者 OO 中的对象后面的数据。

lookup(avalue) 在其生命周期内可以通过多种不同的方式重新实现，只要它作为函数启动即可。

...我一直在为不存在的机器设计程序并添加：“如果我们现在拥有一台包含此处假设的原语的机器，那么工作就完成了。”
...当然，在实际实践中，这种理想的机器将不存在，因此我们的下一个任务（结构上与原始任务类似）是对“上层”机器的模拟进行编程...但是这个一堆程序是为一台很可能不存在的机器编写的，所以我们的下一个工作将是用下一个较低级别机器的程序来模拟它，等等，直到最后我们有一个可以执行的程序我们的硬件...

EW Dijkstra，结构化编程笔记，1969，John Allen 引用， Lisp 剖析，1978。

回复收藏 0 原文

书间行客 2024-10-13 15:42:57

当我想到这个我非常同意的哲学时，首先想到的是代码效率。

当我编写代码时，我确信它并不总是接近完美甚至完全知识渊博。足够了解如何在需要时使机器接近最大效率，并在其余时间保持良好的效率（也许要权衡更好的工作流程），这使我能够生产出高质量的成品。

以数据驱动的方式编码，您最终会使用代码来实现代码的用途。将每个变量“外包”到文件将是愚蠢和极端的，程序的功能需要在程序中，并且内容、设置和其他因素可以由程序管理。

这还允许更多的动态应用程序和新功能。

即使您有一个简单形式的数据库，您也可以将相同的功能应用于许多状态。您还可以做各种创造性的事情，例如根据文件头数据或目录、文件名或扩展名更改程序正在执行的操作的上下文，尽管并非所有数据都必须存储在文件系统上。

最后，将代码保持在仅处理数据的状态会让您更接近于想象实际发生的情况。这也可以避免代码中的大部分内容，从而大大减少臃肿软件。

我相信它使代码更易于维护、更灵活、更高效，我喜欢它。

也感谢其他人对此提出的意见！我发现这非常令人鼓舞。

回复收藏 0 原文

关于最小化代码和最大化数据哲学的思考

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

尘世孤行

烟─花易冷

你是年少的欢喜

倒带

忱杏

送君千里

友情链接

关于最小化代码和最大化数据哲学的思考

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

尘世孤行

烟─花易冷

你是年少的欢喜

倒带

忱杏

送君千里

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。