处理内存中的大型 XML 文档

发布于 2025-01-04 17:08:12 字数 362 浏览 3 评论 0原文

我需要在内存中保存大量 XML(很可能会使用 Oracle Coherence 作为分布式缓存)。期望在内存中保存 100,000 个 XML。这些 XML 相当大 - 大约。每个 250KB。其他系统会请求这些 XML - 它们仅请求与它们相关的部分 XML。此外,他们将要求更改 XML 的内容。负载约为每分钟 300 个此类请求,在检索和更新之间或多或少均匀分布。需要注意的是,XML 不是结构化的,因此我不会为它们提供 XSD,但我确实有提取和更新 XML 的算法。

我的问题是什么会产生更好的性能:将 XML 按原样保留在内存中,并使用 XQuery 甚至使用编码过程从其中提取所有数据和更新,或者将 XML 转换为对象,在其中操作它们代码,然后在其他系统请求时将它们转换回 XML?

I have a need to hold a very large number of XMLs in memory (most probably will use Oracle Coherence as distributed cache). The expectation is to hold in memory 100,000 XMLs. These XMLs are quite big - approx. 250KB each. These XMLs are requested by other systems - they ask for only part of the XML which is relevant to them. Additionally, they will ask to make changes to the content of the XMLs. The load will be about 300 such requests per minute, distributed more or less evenly between retrievals and updates. An important note is that the XMLs are not structured, so I won't have an XSD for them, but I do have the algorithm to extract and update the XMLs.

My question is what will yield better performance: Keeping the XMLs in memory as they are, and making all the extraction of data from them and the updates by using XQuery or even using coded procedures, or to transform the XMLs into objects, manipulate them in code, and then transform them back to XMLs when they are requested by other systems?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我的鱼塘能养鲲 2025-01-11 17:08:12

您有 100,000 个 250 KB 的文档。这使得大约。 24 GB 原始数据。如果您将其放入内存中并希望能够对其进行处理、过滤或更新,您将拥有额外的溢出系数(假设为 10)。那么您最终会获得所需的 240 GB 内存容量。

因此,如果您有足够的可用内存,那么这当然是存放它的最佳位置。但是您需要有一个后备策略(如果节点数量超出内存会发生什么?),如果您不想丢失更新,情况会变得更加复杂:如果机器出现故障会发生什么?如果更新内存:什么时候将更新刷新到磁盘?还有更多的事情需要思考。

然而,回答你的第二个问题:是否转变成物体?大多数人都倾向于使用 PHP、Ruby、Java、“.NET”等将 XML 转换为对象,甚至将 XML 存储在 SQL 数据库中。如果您想听到诚实的答案:如果您没有足够的时间和金钱可以浪费,请不要这样做。对象引入了额外需要的分析、设计、解析、编组、测试、维护的大量开销……事实上,这完全消除了 XML 的灵活性,而且我发现这一点一直被低估。根据我使用 XML 和 XQuery 的经验,在我上面列出的事情上,平均可以节省 80% 左右

另外,如果您将灵活的 XML 数据强行放入对象中,那么如果您的数据结构发生变化,您将面临一场噩梦。

您可能想查看 28msec's Scalable Database forflexible data 这是云中的 PAAS。在那里,您可以获得开箱即用所需的一切(包括负载平衡、自动恢复、持久性管理、复制、备份、自动故障转移、缩放、弹性、内存管理、分片……)。

这只是我个人的观点,但也许它至少有助于您解决问题的更多方面。

You have 100,000 docs a 250 KB. That makes approx. 24 GB of raw data. If you put that in memory and want to be able to process, filter or update it you will have and additional blow out factor of let's say 10. Then you end up in a desired memory capacity of 240 GB.

So, if you have enough memory available that is of course the best place to hold it. But you need to have a fallback strategy (What happens if the number of nodes grows out of memory?) and it becomes even more complicated if you don't want to loose updates: What happens if the machine failes? if you update in-memory: when do you flush out updates to disk? And there are even more things to think about.

Yet, to answer your second question: Transforming into objects or not? Most people are tempted to transfrom XML into objects using PHP, ruby, Java, ".NET" or the like and even to store XML in SQL databases. If you want to hear an honest answer: don't do it if you don't have plentiful of time and money to waste. Objects introduce a large overhead of additionally needed analysis, design, parsing, marshalling, testing, maintenance ... In fact, this removes the flexibility from XML completely and I see this constantly underestimated. From my experience working with XML and XQuery saves you around 80% on average for the things I've listed above.

Also, if you force flexible XML data into objects, you will face a nightmare if your data structures evolve.

You might want to check out 28msec's Scalable Database for flexible data which is a PAAS in the cloud. There you get everything you need out of the box (including loadbalancing, auto-recovery, persistence management, replication, backups, automatic failover, scaling in and out, elasticity, memory management, sharding, ...).

This is only my personal opinion, but maybe it contributes at least some more aspects to your problem solution.

未蓝澄海的烟 2025-01-11 17:08:12

我的猜测是,它在内存中会更快(如果你有足够的空间)。
但对于所有性能问题,这都是由一个大的“这取决于情况”造成的。您需要分析实际用途。

My guess is that it will be faster in memory (if you have enough room).
But with all performance issues this is caviated with a big "it depends". You need to profile the actual usages.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文