是否有充分的理由在内部将数据保存为 XML?
在我工作的这些年里,我注意到了一种我认为是反模式的明显趋势:将内部数据维护为大的 XML 字符串。 我见过很多不同的方法,尽管两个最严重的罪犯非常相似。
Web 服务
第一个应用程序是 Web 服务,它提供对 SQL 数据库中潜在大量数据的访问。 启动时,它或多或少地从数据库中提取所有数据并将其作为 XML 存储在内存中。 (三次。)该应用程序的所有者将其称为缓存。 我称其为慢,因为在解决这个问题时遇到的每个性能问题都可以直接追溯到这个问题。 (在企业环境中,客户端因性能故障而受到指责,而不是服务,这应该不足为奇。)该应用程序确实使用了 XML DOM。
导入器
第二个应用程序读取从第三方数据库导出后生成的 XML 文件。 目标是将这些数据导入专有系统(由我们拥有)。 执行此操作的应用程序会读取整个 XML 文件,并在整个导入序列中维护至少两个(有时多达四个)XML 文件的副本。 请注意,数据可以在导入之前进行操作、转换和配置,因此导入者在整个生命周期中都拥有 XML 格式的数据。 不出所料,当提供中等大小的 XML 文件时,该导入器就会崩溃。 该应用程序仅将 XML DOM 用于其副本之一,其余部分都是原始 XML 字符串。
我对常识的理解表明,XML 并不是一种在内存中保存数据的好格式,而是数据在输出/传输时应转换为 XML,在读取时应转换为内部数据结构并导入。 问题是,我经常遇到完全忽略可扩展性问题的生产代码,并且为此付出了大量额外的努力。 (这些应用程序中字符串解析的绝对数量令人恐惧。)
这是其他人遇到的常见错误吗? 或者只是我运气不好? 或者我是否错过了一些非常明显且良好的情况,在这些情况下,将大量数据以 XML 形式存储在内存中是正确且可以的?
In the years that I've been at my place of employment, I've noticed a distinct trend towards something that I consider an anti-pattern: Maintaining internal data as big strings of XML. I've seen this done a number of different ways, though the two worst offenders were quite similar.
The Webservice
The first application, a web service, provides access to a potentially high volume of data within a SQL database. At startup, it pulls more-or-less all of that data out of the database and stores it in memory as XML. (Three times.) The owners of this application call it a cache. I call it slow, because every perf problem that's been run into while working against this has been directly traceable to this thing. (It being a corporate environment, there should be no surprise that the client gets blamed for the perf failure, not the service.) This application does use the XML DOM.
The Importer
The second application reads an XML file that was generated as the result of an export from a third-party database. The goal is to import this data into a proprietary system (owned by us). The application that does it reads the entire XML file in and maintains at least two, sometimes as many as four, copies of the XML file throughout the entire importing sequence. Note that the data can be manipulated, transformed, and configuration can occur before the import takes place, so the importer owns this data in an XML format for it's entire lifetime. Unsurprisingly, this importer then explodes when a moderately sized XML file is provided. This application only uses the XML DOM for one of it's copies, the rest are all raw XML strings.
My understanding of common sense suggests that XML is not a good format for holding data in-memory, but rather data should be translated into XML when it's being output/transferred and translated into internal data structures when being read in and imported. The thing is, I'm constantly running into production code that completely ignores the scalability issues, and goes through a ton of extra effort to do so. (The sheer volume of string parsing in these applications is frightening.)
Is this a common failure to apply the right tool for the job that others people run into alos? Or is it just bad luck on my part? Or am I missing some blindingly obvious and good situations where it's Right and OK to store high volumes of data in-memory as XML?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
存储在内存中的任何数据都应该位于类中。 我们谈论的数据量越大,这一点就变得越重要。 Xml 是一种非常臃肿的格式,会降低性能。 Xml 只能用于在应用程序之间传输数据。 恕我直言。
Any data stored in memory should be in classes. The higher volume of data we are talking about, the more important this becomes. Xml is a hugely bloated format that reduces performance. Xml should be used only for transfering data between applications. IMHO.
不,我同意。 对于第一个示例,数据库应该处理几乎所有缓存,因此将所有数据存储在程序内存中是错误的。 无论它是以 XML 形式还是以其他方式存储在内存中,这都适用。
对于第二个,您应该尽快将 XML 转换为有用的表示形式(可能是数据库),然后以这种方式使用它。 仅当数据量较小时,才适合将所有工作作为 XmlDocument 在内存中完成(例如使用 XPath)。 应非常谨慎地使用字符串解析。
No, I agree. For your first example, the database should handle almost all the caching, so storing all the data in program memory is wrong. This applies whether it's stored in-memory as XML or otherwise.
For the second, you should convert the XML into a useful representation as soon as possible, probably a database, then work with it that way. Only if it's a small amount of data would it be appropriate to do all work in-memory as a XmlDocument (e.g. using XPath). String parsing should be used very sparingly.
@Matthew Flaschen 提出了一个很好的观点。 我想补充一点,当您加入任何现有项目时,您可能会发现一些您不同意的设计和实现决策。
我们都在不断学习新事物,但我们都会犯错误。 尽管我同意这似乎是一个“duh”问题,但我确信其他开发人员正在尝试通过缓存的概念来优化代码。
关键是,有时需要温和的方法来说服人们,尤其是开发人员,改变他们的方式。 这不是编码问题,而是人的问题。 您需要找到一种方法来说服这些开发人员,您建议的这些更改并不意味着他们无能。
我建议同意他们的观点,即缓存可能是一个好主意,但您希望对其进行研究以加速功能。 创建一个快速演示,展示您的(更符合逻辑的)实现方式与旧方式相比的工作方式。 速度的显着提高是无可争议的。 只是要小心直接攻击他们在对话中的实施方式。 您需要这些人与您一起工作。
祝你好运!
@Matthew Flaschen makes a great point. I would like to add that when you join any existing project, you are likely to find some design and implementation decisions that you disagree with.
We all learn new things all the time and we all make mistakes. Though I agree that this seems like a "duh" kind of problem, I'm sure the other developers were trying to optimize the code through the concept of a cache.
The point is, sometimes it takes an gentle approach to convince people, especially developers, to change their ways. This isn't a coding problem, but a people problem. You need to find a way to convince these developers that these changes you are suggesting don't imply they are incompetent.
I'd suggest agreeing with them that caching can be a great idea, but that you'd like to working on it to speed up the functions. Create a quick demo of how your (way more logical) implementation works compared with the old way. It's hard to argue with dramatic speed improvements. Just be careful about directly attacking the way they implemented in conversation. You need these people to work with you.
Good luck!
我也同意,而且我确实认为有运气不好的因素。
...但是抓住救命稻草,我能看到以 XML 形式存储的数据的唯一用途是用于自动化单元测试,其中 XML 提供了一种模拟测试数据的简单方法。 但绝对不值得。
I agree as well, and I do think there is an element of bad luck.
...but grabbing for straws, the only use I could see for data being stored as XML is for automated unit tests, where XML provides an easy way to mock up test data. Definitely not worth it, though.
我发现我必须这样做才能与旧版 COM 对象交互。 COM 对象可以采用 xml 或类。 填充类的每个成员的互操作开销太大,处理 xml 是一种更快的替代方案。 我们本可以使 ac# 类与 COM 类相同,但在我们的时间范围内这确实太难了。 原来是 xml。 这并不是一个好的设计决策,但在处理大型数据结构的互操作时,这是我们能做到的最快的。
我不得不说,我们在 C# 端使用 LinqtoXML,因此使用起来稍微容易一些。
I've found that I've had to do it to interact with a legacy COM object. The COM object could take either xml or a class. The interop overhead to fill each member of the class was way too large and processing xml was a much faster alternative. We could have made a c# class identical to the COM class, but it was really too difficult to do in our timeframe. So xml it was. Not that it would ever be a good design decision, but when dealing with interop for huge data structures, it was the fastest we could do.
I do have to say that we are using LinqtoXML on the C# side, so it makes it slightly easier to work with.
OOP 和数据库怎么样? Xml 有其用途,但将其用于所有用途可能会出现问题(如您所见)。
数据库可以允许索引、事务等,这将加快您的数据访问速度
对象在大多数情况下更容易使用,它们可以更好地了解您的域等。
我不反对使用 xml,但它就像模式,它们是一种工具,我们应该了解在何时何地使用它们,而不是爱上它们并尝试在任何地方使用它们......
what about OOP and Databases? Xml has it's uses but there can be issues (as you are seeing) with using it for everything.
Databases can allow for indexing, transactions, etc. that will speed up your data access
Objects are in most cases easier to work with, They give a better picture of your domain, etc.
I am not against using xml but it is like patterns, they are a tools that we should understand where and when to use them, not fall in love with them and try to use them everywhere...
格雷格,
在几个应用程序中,我确实或多或少地遵循了您描述的模式:
编辑:没有划痕。 我从未将 XML 存储为字符串(或多个字符串)。 我只是将其解析为 DOM 并使用它。 这很有帮助。
我已将 XML 源导入到 DOM(Microsoft Parser)中,并将它们保留在那里以进行所有必需的处理。 我很清楚 DOM 造成的内存开销,但我发现该方法仍然非常有用。
处理过程中的一些检查需要随机访问数据。 selectPath 语句非常适合此目的。
DOM 节点可以作为参数在应用程序中来回传递。 另一种方法是编写包装每种类型对象的类,并随着 XML 模式的发展而更新它们。 这是一种糟糕的(VB6/VBA)多态性方法。
将 XSLT 转换应用于全部或部分 DOM 轻而易举
文件 I/O 也由 DOM 处理(xmldoc.save...)
对象的链接列表将消耗相当数量的内存并需要更多代码。 所有的搜索和 I/O 功能我都必须自己编写代码。
我所认为的反模式实际上是应用程序的旧版本,其中 XML 或多或少被手动解析为结构数组。
Greg,
in several applications I did follow more or less exactly the pattern you describe:
Edit: no scratch that. I never stored the XML as a string (or multiple strings). I just parsed it into a DOM and worked with that. THAT was helpful.
I've imported XML sources into the DOM (Microsoft Parser) and kept them there for all the required processing. I'm well aware of the memory overhead the DOM causes, but I found the apporach quite useful nonetheless.
Some checks during processing need random access to the data. The selectPath statement works quite well for this purpose.
DOM nodes can be handed back and forth in the application as arguments. The alternative is writing classes wrapping every single type of object, and updating them as the XML schema evolves. It's a poor (VB6/VBA) man's approach to polymorphism.
Applying an XSLT transformation to all or parts of the DOM is a snap
File I/O is taken care of by the DOM too (xmldoc.save...)
A linked list of objects would consume a comparable amount of memory and require more code. All the search and I/O functionality I would have to code myself.
What I've perceived as the anti-pattern is actually an older version of the application, where the XML was parsed more or less manually into arrays of structures.
对于大量数据,答案是否定的,没有充分的理由将数据直接作为 XML 字符串存储在内存中。
然而,这里有一个有趣的 Alex Brown 的演示文稿,介绍如何以更有效的方式在内存中保存 XML。 作为“冰冻溪流”。
此处还提供了有关此内容的视频以及 2009 年 XML 布拉格会议上的其他演示文稿。
链接文本
For high volumes of data the answer is no, there aren't good reasons to store data directly as XML strings in memory.
However, here is an interesting presentation, by Alex Brown, on how to preserve XML in memory in a more efficient way. As a 'Frozen Stream'.
There is also a video of this, and other presentations given at XML Prague 2009 here.
link text
一般来说,我会尝试使用独立于 XML 序列化的内部数据模型。
但是,在我看来,在一种情况下,使用 XML 作为内部数据结构是有意义的:如果您的数据模型需要捕获其格式可由第三方扩展的层次结构关系,并且如果您的应用程序需要转发此数据,同时保留扩展信息。
以lumberjack 日志框架为例:其想法是拥有一个基于 XML 的事件数据模型,其中每个应用程序都可以提供有关事件(警告、错误等)的分层信息。 该框架负责收集事件并将它们分发给适当的处理程序。 第三方可以轻松地定义自己的格式补充,并提供适当的生成器和处理程序。
这里重要的部分是框架必须将包含所有 XML 信息的 XML 完整地从生成器转发到处理程序。 在这种情况下,实现捕获所有必要信息的内部数据结构会导致重新实现大部分 XML 本身。因此,使用适当的 DOM 框架来表示内部数据是有意义的。
In general, I would try to use an internal data model that is independent of its serialization in XML.
However, in my opinion there is one case where using XML as an internal data structure makes sense: If your data model needs to capture hierarchical relationships whose format can be extended by 3rd parties and if your application needs to forward this data while preserving the extended information.
Take, for example, the lumberjack logging framework: The idea is to have an XML-based event data model in which every application can provide hierarchical information about events (warnings, errors, etc.). The framework takes care of gathering the events and distributing them to the appropriate handlers. A 3rd party can easily define its own additions to the format, and provide appropriate generators and handlers.
The important part here is that the framework has to forward the XML with all the XML information intact from the generator to a handler. In this case implementing an internal data structure which captures all the necessary information results in a re-implementation of most of XML itself. Hence, using an appropriate DOM framework for internal data representation makes sense.