重构大数据对象
重构大型“仅限国家”对象的常见策略是什么?
我正在开发一个特定的软实时决策支持系统,该系统对国家空域进行在线建模/模拟。该软件消耗大量实时数据源,并每分钟对空域中大量实体的“状态”进行一次估计。问题会顺利分解,直到我们找到当前最低级别的实体。
我们的数学模型大约每分钟一次估计/预测每个实体过去和未来几个小时的时间线中的 50 个以上参数。目前,这些记录被编码为具有许多字段的单个 Java 类(其中一些字段被折叠到 ArrayList
中)。我们的模型正在不断发展,字段之间的依赖关系尚未一成不变,因此每个实例都会在一个复杂的模型中徘徊,并在运行过程中积累设置。
目前我们有类似下面的东西,它使用构建器模式方法来构建记录的内容,并强制执行已知的依赖关系(作为对程序员错误的检查,以发展模式。)一旦完成估计,我们使用 .build()
类型方法将以下内容转换为不可变形式。
final class OneMinuteEstimate {
enum EstimateState { INFANT, HEADER, INDEPENDENT, ... };
EstimateState state = EstimateState.INFANT;
// "header" stuff
DateTime estimatedAtTime = null;
DateTime stamp = null;
EntityId id = null;
// independent fields
int status1 = -1;
...
// dependent/complex fields...
... goes on for 40+ more fields...
void setHeaderFields(...)
{
if (!EstimateState.INFANT.equals(state)) {
throw new IllegalStateException("Must be in INFANT state to set header");
}
...
}
}
一旦完成大量的估计,它们就会被组装成时间表,并在其中分析聚合模式/趋势。我们考虑过使用嵌入式数据库,但一直在解决性能问题;我们宁愿根据数据建模来解决这个问题,然后逐步将部分软实时代码移动到嵌入式数据存储中。
一旦完成其中的“时间敏感”部分,产品就会刷新到平面文件和数据库中。
问题:
- 这是一个巨大的类,包含太多字段。
- 类中编码的行为非常少;它主要是数据字段的持有者。
- 维护
build()
方法非常麻烦。 - 仅仅为了确保大量依赖的建模组件正确填充数据对象而手动维护“状态机”抽象感觉很笨拙,但随着模型的发展,它为我们节省了很多挫败感。
- 存在大量重复,特别是当上述记录被聚合成非常相似的“汇总”时,其相当于时间序列中上述结构的滚动总和/平均值或其他统计产品。
- 虽然某些字段可能会聚集在一起,但它们在逻辑上都是彼此“同等”的,并且我们尝试过的任何细分都会导致行为/逻辑人为地分裂,并且需要达到间接的两个深度级别。
开箱即用的想法受到欢迎,但这是我们需要逐步发展的东西。在其他人说之前,我要指出的是,如果我们的数学模型的数据表示如此难以获得,人们可能会认为我们的数学模型不够清晰。公平地说,我们正在努力解决这一问题,但我认为这是拥有大量贡献者和大量并发假设的研发环境的副作用。
(这并不重要,但这是用 Java 实现的。我们使用 HSQLDB 或 Postgres 作为输出产品。我们不使用任何持久性框架,部分原因是不熟悉,部分原因是我们仅使用数据库就有足够的性能问题单独和手工编码的存储例程......我们对转向额外的抽象持怀疑态度。)
What are some common strategies for refactoring large "state-only" objects?
I am working on a specific soft-real-time decision support system which does online modeling/simulation of the national airspace. This piece of software consumes a number of live data feeds, and produces a once-per-minute estimate of the "state" of a large number of entities in the airspace. The problem breaks down neatly until we hit what is currently the lowest-level entity.
Our mathematical model estimates/predicts upwards of 50 parameters for a timeline of several hours into the past and future for each of these entities, roughly once per minute. Currently, these records are encoded as a single Java class with a lot of fields (some get collapsed into an ArrayList
). Our model is evolving, and the dependencies among the fields are not yet set in stone, so each instance wanders through a convoluted model, accumulating settings as it goes along.
Currently we have something like the following, which uses a builder pattern approach to build up the contents of the record, and enforce what the known dependencies are (as a check against programmer error as evolve the mode.) Once the estimate is done, we convert the below into an immutable form using a .build()
type method.
final class OneMinuteEstimate {
enum EstimateState { INFANT, HEADER, INDEPENDENT, ... };
EstimateState state = EstimateState.INFANT;
// "header" stuff
DateTime estimatedAtTime = null;
DateTime stamp = null;
EntityId id = null;
// independent fields
int status1 = -1;
...
// dependent/complex fields...
... goes on for 40+ more fields...
void setHeaderFields(...)
{
if (!EstimateState.INFANT.equals(state)) {
throw new IllegalStateException("Must be in INFANT state to set header");
}
...
}
}
Once a very large number of these estimates are complete, they are assembled into timelines where aggregate patterns/trends are analyzed. We have looked at using an embedded database but have struggled with performance issues; we'd rather get this sorted out in terms of data modeling and then incrementally move portions of the soft-real-time code into an embedded data store.
Once the "time sensitive" pieces of this are done, the products are flushed to flat files and a database.
Problems:
- It's a giant class, with way too many fields.
- There is very little behavior encoded in the class; it's mostly a holder for data fields.
- Maintaining the
build()
method is extremely cumbersome. - It feels clumsy to manually maintain a "state machine" abstraction merely for the purpose of ensuring that a large number of dependent modeling components are properly populating a data object, but it has saved us a lot of frustration as the model evolves.
- There is a lot of duplication, particularly when the records described above are aggregated into very similar "rollups" which amount to rolling sums/averages or other statistical products of the above structure in time series.
- While some of the fields could be clumped together, they are all logically "peers" of one another, and any breakdown we've tried has resulted in having behavior/logic artificially split and needing to reach two levels deep in indirection.
Out of the box ideas entertained, but this is something we need to evolve incrementally. Before anyone else says it, I'll note that one could suggest that our mathematical model is insufficiently crisp if the data representation for that model is this hard to get ahold of. Fair point, and we're working that, but I think that's a side-effect of an R&D environment with a lot of contributors, and a lot of concurrent hypotheses in play.
(Not that it matters, but this is implemented in Java. We use HSQLDB or Postgres for output products. We don't use any persistence framework, partly out of a lack of familiarity, partly because we have enough performance trouble with just the database alone and hand-coded storage routines... we're skeptical of moving towards additional abstraction.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我有很多和你一样的问题。
至少我认为我做到了,听起来就像我做到了。表现形式有所不同,但在 10,000 英尺的高度,听起来几乎是一样的。一堆离散的“任意”变量以及它们之间的一堆临时关系(本质上是业务驱动的),可能会随时发生变化。
您还提到了另一个问题,那就是性能要求。听起来速度越快越好,而且很可能一个缓慢的完美解决方案会被快速糟糕的解决方案抛弃,仅仅是因为较慢的解决方案无法满足基准性能要求,无论它有多好。
简而言之,我所做的是为我的系统设计了一种简单的特定于领域的规则语言。
DSL 的全部要点是隐式表达关系并将它们打包到模块中。
非常粗鲁、人为的示例:
首先,这并不代表我的语法。
但你可以从模块中看到,它是 3 个简单的规则。
但关键是,从这里可以明显看出,规则 1 依赖于 C,而 C 又依赖于 A 和 B,而 B 又依赖于 A。这些关系是隐含的。
因此,对于该模块,所有这些依赖项都“随之而来”。您可以看到,如果我为模块 1 生成了代码,它可能看起来像这样:
而如果我创建了模块 2,我得到的只是:
在模块 3 中,您会看到“免费”重用:
所以,即使我有一个“汤”规则中,模块根植于依赖关系的基础,从而过滤掉它不关心的东西。抓住一个模块,摇动树并保留剩下的悬挂物。
我的系统使用 DSL 生成源代码,但您也可以轻松地让它创建一个迷你运行时解释器。
简单的拓扑排序为我处理了依赖图。
因此,这样做的好处是,虽然最终生成的逻辑中不可避免地存在重复,至少在模块之间是这样,但规则库中没有任何重复。作为开发人员/知识工作者,您维护的是规则库。
同样好的一点是,您可以更改方程式,而不必太担心副作用。例如,如果我更改 do C = A / 2,那么突然间,B 完全消失。但 IF (C < 10) 的规则根本没有改变。
使用一些简单的工具,您可以显示整个依赖关系图,您可以找到孤立变量(如 B)等。
通过生成源代码,它将按照您想要的速度运行。
就我而言,有趣的是看到规则删除单个变量并看到 500 行源代码从生成的模块中消失。这就是 500 行,我不必在维护和开发期间手动爬行并删除它们。我所要做的就是更改规则库中的一条规则,然后让“魔法”发生。
我什至能够进行一些简单的窥孔优化并消除变量。
这并不难做到。您的规则语言可以是 XML,也可以是简单的表达式解析器。如果您不愿意,没有理由完全使用 Yacc 或 ANTLR。我会为 S 表达式添加一个插件,不需要语法,脑残解析。
实际上,电子表格也是一个很好的输入工具。只要严格控制格式即可。在 SVN 中合并有点糟糕(所以,不要这样做),但最终用户喜欢它。
您很可能能够摆脱基于实际规则的系统。我的系统在运行时不是动态的,并且并不真正需要复杂的目标搜索和推理,因此我不需要这样的系统的开销。但如果有一个开箱即用的适合你,那么快乐的一天。
哦,对于实现说明,对于那些不相信可以在 Java 方法中达到 64K 代码限制的人,我可以向您保证这是可以做到的:)。
I had much of the same problem you did.
At least I think I did, sounds like I did. Representation was different, but at 10,000 feet, sounds pretty much the same. Crapload of discrete, "arbitrary" variables and a bunch of ad hoc relationships among them (essentially business driven), subject to change at a moment's notice.
You also have another issue, which you sorta mentioned, and that was the performance requirement. Sounds like faster is better, and likely a slow perfect solution would be tossed out for the fast lousy one, simply because the slower one can't meet a baseline performance requirement, no matter how good it is.
To put it simply, what I did was I designed a simple domain specific rule language for my system.
The entire point of the DSL was to implicitly express relationships and package them up in to modules.
Very crude, contrived example:
First, this is not representative of my syntax.
But you can see from the Modules, that it is 3, simple rules.
The key though, is that it's obvious from this that Rule 1 depends on C, which depends on A and B, and B depends on A. Those relationships are implied.
So, for that module, all of those dependencies "come with it". You can see if I generated code for Module 1 it might look something like:
Whereas if I created Module 2, all I would get is:
In Module 3 you see the "free" reuse:
So, even though I have one "soup" of rules, the Modules root the base of the dependencies, and thus filter out the stuff it doesn't care about. Grab a module, shake the tree and keep what's left hanging.
My system used the DSL to generate source code, but you can easily have it create a mini runtime interpreter as well.
Simple topological sorting handled the dependency graph for me.
So, the nice thing about this is that while there was inevitable duplication in the final, generated logic, at least across modules, there wasn't any duplication in the rule base. What you as a developer/knowledge worker maintain is the rule base.
What is also nice is that you can change an equation, and not worry so much about the side effects. For example, if I change do C = A / 2, then, suddenly, B drops out completely. But the rule for IF (C < 10) doesn't change at all.
With a few simple tools, you can show the entire dependency graph, you can find orphaned variables (like B), etc.
By generating source code, it's going to run as fast as you want.
In my case, it was interesting to see a rule drop a single variable and see 500 lines of source code vanish from the resulting module. That's 500 lines I didn't have to crawl through by hand and remove during maintenance and development. All I had to do was change a single rule in my rule base and let "magic" happen.
I was even able to do some simple peephole optimization and eliminate variables.
It's not that hard to do. Your rule language can be XML, or a simple expression parser. No reason to go full boat Yacc or ANTLR on it if you don't want to. I'll put a plug in for S-Expressions, no grammar needed, brain dead parsing.
Spreadsheets also make a great input tool, actually. Just be strict on the formatting. Kind of sucks for merging in SVN (so, Don't Do That), but end users love it.
You may well be able to get away with an actual rule based system. My system wasn't dynamic at runtime, and didn't really need sophisticated goal seeking and inference, so I didn't need the overhead of such a system. But if one works for you out of the box, then happy day.
Oh, and for an implementation note, for those who don't believe you can hit the 64K code limit in a Java method, well I can assure you it can be done :).
拆分大型数据对象与规范化大型关系表(第一范式和第二范式)非常相似。遵循规则至少达到第二范式,您就可以对原始类进行良好的分解。
Splitting a Large Data Object is very similar to Normalizing a Large Relational Table (first and second normal form). Follow the rules to reach at least second normal form and you may have a good decomposition of the original class.
根据与具有软实时性能约束(有时是怪物脂肪类)的研发人员一起工作的经验,我建议不要使用 OR 映射器。在这种情况下,您最好“接触金属”并直接使用 JDBC 结果集。这是我对具有软实时约束和每个包包含大量数据项的应用程序的建议。更重要的是,如果需要持久化的不同类(不是类实例,而是类定义)的数量很大,并且您的规范中也有内存限制,您还需要避免像冬眠。
回到你原来的问题:
你似乎遇到了一个典型的问题:1)将多个数据项映射到面向对象模型中,2)这样的多个数据项没有表现出良好的分组或隔离方式(以及任何分组尝试)往往感觉不太对。)有时,领域模型不适合这种聚合,并且提出一种人为的方法通常会导致无法满足所有设计要求的妥协和欲望。
更糟糕的是,面向对象模型通常要求/期望您将类中存在的所有项目作为类的字段。这样的类通常没有行为,因此它只是一个类似于结构的构造,也称为数据信封或数据穿梭机。
但这种情况会引发以下问题:
您的应用程序是否需要始终同时读取/写入所有 40、50 多个数据项?
*所有数据项都必须始终存在吗?*
我不知道您的问题域的具体情况,但总的来说,我发现我们很少需要同时处理所有数据项。这就是关系模型的闪光点,因为您不必一次查询表中的所有行。您只需提取所需的内容作为相关表/视图的投影。
在我们可能有大量数据项的情况下,但平均来说,沿线路传递的数据项数量小于最大值,您最好使用属性模式。
不是定义一个包含所有项目的怪物信封类:
定义一个字典(例如基于地图):
不需要设置
只读
内部标志。您所需要做的就是将信封实例向下转换为Envelope
实例(仅提供 getter)。期望读取的代码应该在只读信封上运行,而期望更改字段的代码应该在可变信封上运行。实际实例的创建将在工厂中进行划分。
也就是说,您使用编译器通过建立一些代码约定、管理在何处以及如何使用哪些接口的规则来强制事物为只读(或允许事物可变)。
您可以将代码分层为需要编写的部分,与只需要读取的代码分开。一旦完成,简单的代码审查(甚至 grep)就可以识别使用错误接口的代码。)
问题:
非公共父接口:
Envelope
未声明为一个公共接口,用于防止错误/恶意代码将只读信封转换为基本信封,然后返回可变信封。预期的流程是从可变到只读 - 它不是双向的。这里的问题是
Envelope
的扩展仅限于包含它的包。这是否是一个问题将取决于特定的领域和预期用途。工厂:
问题是工厂可能(而且很可能)非常复杂。再次,野兽的本性。
验证:
此方法引入的另一个问题是,现在您必须担心期望字段 X 存在的代码。拥有原始的怪物信封类可以部分地使您摆脱这种担忧,因为至少在语法上,所有字段都在那里......
无论字段是否设置,这是另一个仍然存在于这个新模型中的问题提议。
因此,如果您的客户端代码希望看到字段 X,则如果该字段不存在(或者计算机或以某种方式读取合理的默认值),则客户端代码必须抛出某种类型的异常。在这种情况下,您必须
关联自定义验证器(只读信封接口的代理),这些验证器根据某些规则(通过解释器或规则引擎以编程方式提供的规则)抛出异常或计算缺失字段的默认值。
缺乏键入:
这可能是有争议的,但是习惯使用静态类型的人们可能会因为采用松散类型的基于映射的方法而失去静态类型的好处而感到不安。对此的反驳是,大多数 Web 都采用松散类型的方法,即使在 Java 端(JSTL、EL)也是如此。
抛开问题不谈,可能的字段的最大数量越大,出现的平均字段数量越低在任何给定时间,这种方法将是最有效的 wrt 性能。它增加了额外的代码复杂性,但这就是野兽的本质。
这种复杂性不会消失,并且会出现在您的类模型或验证代码中。不过,串行化和在线传输效率要高得多,特别是如果您期望进行大量的单独数据传输。
希望有帮助。
From experience working also with R&D stuff with soft real-time performance constrains (and sometimes monster fat classes), I would suggest NOT to use OR mappers. In such situations, you'll be better off dealing "touching the metal" and working directly with JDBC result sets. This is my suggestion for apps with soft real-time constrains and massive amounts of data items per package. More importantly, if the number of distinct classes (not class instances, but class definitions) that need to persisted is large, and you also have memory constrains in your specs, you will also want to avoid ORMs like Hibernate.
Going back to your original question:
What you seem to have is a typical problem of 1) mapping multiple data items into a OO model and 2) such multiple data items do not exhibit a good way of grouping or segregation (and any attempt to grouping tends simply not to feel right.) Sometimes the domain model does not lend itself for such aggregation, and coming up with an artificial way of doing so typically ends up in compromises that don't satisfy all design requirements and desires.
To make matters worse, a OO model typically requires/expects you to have all the items present in a class as class' fields. Such a class is typically without behavior, so it is just a
struct
-like construct, akadata envelope
ordata shuttle
.But such situations beg the following questions:
Does your application need to read/write all 40, 50+ data items at once, always?
*Must all data items be always present?*
I do not know the specifics of your problem domain, but in general I've found that we rarely ever need to deal with all data items at once. This is where a relational model shines because you don't have to query all rows from a table at once. You only pulls those you need as projections of the table/view in question.
In a situation where we have a potentially large number of data items, but on average the number of data items being passed down the wire is less than the maximum, you'd be better off using a Properties pattern.
Instead of defining a monster envelope class holding all items :
Define a dictionary (based on a map for example):
No need to set up
read-only
internal flags. All you need to do is downcast your envelope instances asEnvelope
instances (that only provide getters).Code that expects to read should operate on read-only envelopes and code that expects to change fields should operate on mutable envelopes. Creation of the actual instances would be compartmentalized in factories.
That is, you use the compiler to enforce things to be read-only (or allow things to be mutable) by establishing some code conventions, rules governing what interfaces to use where and how.
You can layer your code into sections that need to write separate from code that only needs to read. Once that's done, simple code reviews (or even grep) can identify code that is using the wrong interface.)
Problems:
Non-public Parent Interface:
Envelope
is not declared as a public interface to prevent erroneous/malicious code from casting a read-only envelope down to a base envelope and then back to a mutable envelope. The intended flow is from mutable to read-only only - it is not intended to be bi-directional.The problem here is that extension of
Envelope
is restricted to the package that contains it. Whether that is a problem will depend on the particular domain and intended usage.Factories:
The problem is that factories can (and most likely will) be very complex. Again, the nature of the beast.
Validation:
Another problem introduced with this approach is that now you have to worry about code that expects field X to be present. Having the original monster envelope class partially frees you from that worry because, at least syntactically, all fields are there...
... whether the fields are set or not, that was another matter that still remains with this new model I'm proposing.
So if you have client code that expects to see field X, the client code has to throw some type of exception if the field is not present (or to computer or read a sensible default somehow.) In such cases, you will have to
Identify patterns of field presence. Clients that expect field X to be present might be grouped separately (layered apart) from clients that expect some other field to be present.
Associate custom validators (proxies to read-only envelope interfaces) that either throw exceptions or compute default values for missing fields according to some rules (rules provided programmatically, with an interpreter, or with a rules engine.)
Lack of Typing:
This might be debatable, but people used to work with static typing might feel uneasy with losing the benefits of static typing by going to a loosely typied map-based approach. The counter-argument of this is that most of the web works on a loose typing approach, even on the Java side (JSTL, EL.)
Problems aside, the larger the maximum number of possible fields and the lower the average number of fields present at any given time, the most effective wrt performance this approach will be. It adds additional code complexity, but that's the nature of the beast.
That complexity doesn't go away, and either will be present in your class model or in your validation code. Serialization and transferring down the wire is much more efficient, though, specially if you expect massive numbers of individual data transfers.
Hope it helps.
实际上,这看起来像是游戏开发人员经常面临的问题,由于继承树太深等原因,臃肿的类持有大量变量和方法。
有 这篇关于如何以及为什么选择组合而不是继承的博客文章,也许会有帮助。
Actually this looks like a frequent problem that game developers face, bloated classes holding numerous variables and methods because of a deep inheritance tree etc.
There's this blog post about how and why to select composition over inheritance, maybe it would help.
智能地分解大型数据类的一种方法是查看客户端类的访问模式。例如,如果一组类仅访问字段 1-20,而另一组类仅访问字段 25-30,则这些字段组可能属于不同的类。
One way you may be able to intelligently break up a large data class is to look at patterns of access by client classes. For example, if a set of classes only accesses fields 1-20 and another set of classes only accesses fields 25-30, maybe those groups of fields belong in separate classes.