以类似 SAX 的方式从磁盘对 XML 进行二进制搜索 - 明智吗?可能的?

发布于 2024-12-04 17:33:02 字数 504 浏览 4 评论 0原文

我发现自己需要以动画帧类型的速度在(可能)大型 XML 文件中搜索具有特定时间戳的项目。

我在最近的一个项目中一直在做类似的事情,但是 XML 足够小,可以容纳在内存中,因此我将其解析为简单对象数组并对其进行二进制搜索。繁荣!超快速搜索每帧 800 多个带有时间戳的项目。

这一次,XML 文件可能足够大,以至于将它们解析到内存中是一个愚蠢的想法(这是 iOS 的东西,所以 RAM 是有限的)。我脑海中的解决方案是从文件中进行类似 SAX 的流解析,但使用可设置的指针。因此,我可以在另一个二分搜索中在文件周围跳转该指针,解析文件中的下一个完整节点,并使用它来通知搜索指针下一步跳转的位置。

我认为这是一个很好的理论。然而,环顾互联网,我还没有找到一个允许设置文件中当前行号的 SAX 解析器。许多都为您提供只读访问权限作为一种状态,但没有一个允许如此重要的位置设置。

所以。有谁知道有这样的能力的XML解析库吗?再说一次,这是 iOS 世界,所以任何基于 C/C++ 的东西都可以,但如果它有 Obj-C 包装器,那就加分了。

I find myself needing to search through a (potentially) large XML file for items with a specific timestamp at within-an-animation-frame type speeds.

I've been doing something similar in a recent project, but there the XML was small enough to fit in memory, so I parsed it out into an array of simple objects and binary-searched it. BOOM! super-quick search through 800-odd timestamped items per-frame.

This time around, the XML files might well be large enough to make parsing them out into memory a stupid idea (this is iOS stuff, so RAM is limited). The solution in my head is to do SAX-like stream parsing from a file, but with a settable pointer. So I could jump that pointer around the file in another binary search, parse the next complete node in the file, and use that to inform where the search pointer jumps next.

A good theory, I think. However, looking around the internets, I haven't been able to find a SAX parser that allows setting of its current line number in the file. Many give you read-only access as a status, but none allow that oh-so-crucial position setting.

SO. Does anyone know of an XML parsing lib that has such an ability? Again, this is iOS world, so anything C/C++ based would do, but bonus points if it has an Obj-C wrapper.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

鹿童谣 2024-12-11 17:33:02

您无法在 XML 中安全地执行此操作,至少不能直接执行此操作。您说您想跳转到某个行号,但这可能对您没有帮助,因为 XML 不是基于行的。而且您无法轻松跳转到某个节点的第 n 个子节点,因为这需要完全解析 XML。

您可以做的是首先解析整个文件并创建一个索引:对于每个节点(您需要跳转到的节点),您记住它在文件中的起始位置(可能是字节偏移量)。您可以使用 SAX(或类似 SAX)解析器来完成此操作,不需要将整个文档存储在内存中。

如果你这样做,你必须解析整个文件一次(O(n)操作),但你可以跳转任何节点并快速解析(O(1)),这应该使二分搜索具有高性能。

或者您可以根据要搜索的属性创建索引。如果你这样做,整个二分搜索将在内存中,你可以只解析你需要的一个(或几个)节点,这应该更快。

You can't do that safely in XML, at least not directly. You said you want to jump to a certain line number, but that might not help you, because XML is not line based. And you can't easily jump to n-th child of some node, because that requires fully parsing the XML.

What you can do is to first parse the whole file and create an index: for each node (of those you need to jump to) you remember its start position in the file (probably in as a byte offset). And you can do this using SAX (or SAX-like) parser, you don't need to have the whole document in memory.

If you do it this way, you have to parse the whole file once (O(n) operation), but you can then jump any node and parse quickly (in O(1)), which should make the binary search performant.

Or you could create the index based on the property you want to search. If you do this, the whole binary search will be in-memory and you can parse just the one (or few) node you need, which should be even faster.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文