如何公开大量 XML 文档(~2M)以供离线查询(xpath)?
我在 16GB 的文件系统空间上有不到 200 万个 XML 文档。它们都是有效的并且共享一个 DTD。它们的大小大致相同(均由同一实验室信息系统生成)。
我正在寻找一种简单的方法让单个用户查询整个 2M 文档语料库。我不想将其公开给网络甚至多个 LAN 用户;但是,我希望它能够向我的内联网公开一些查询接口。我对查询语言很灵活,但我希望能够进行临时查询。我希望它至少具有类似的性能,并且我愿意根据需要分配额外的磁盘空间来容纳索引。
一个可行的解决方案必须在具有 8GB RAM 的单个四核 Linux 机器上表现不佳,新硬件不是一个选择。
我找到了 e-Xist DB,但它似乎没有太多活动,而且演示站点也已关闭。
I have just short of 2 million XML documents sitting on 16gb of file system space. They are all valid and share a single DTD. They are all of roughly equal size (all generated by the same lab information system).
I'm looking for an easy way for a single user to query the whole 2M doc corpus. I'm not looking to expose this to the web or even multiple LAN users; however, I would like it be able to expose some query interface to my intranet. I'm flexible on the query language but I would like to be able to do ad hoc queries. I want it to be at least simi-performant and I'm willing to dedicate additional disk space as needed to accommodate indexes.
A workable solution has to be deplorable on a single quad core Linux box with 8gb of RAM, new hardware isn't an option.
I found e-Xist DB but it doesn't seem to have all that much in the way of activity and the demo site is down.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我会按这个顺序尝试:
我的预感是伯克利将会是最快的,但 BaseX 和 Sedna 都可以通过网络访问,并且 BaseX 将是最容易开始使用和查询的。 Sedna 还有一个模式感知存储系统,这可能对您描述的情况有益。如果您有商业用途,伯克利的 sleepycat 许可证可能对您来说是一种负担 - 请仔细查看。
I would try in this order:
My hunch is that Berkeley would be the fastest, but BaseX and Sedna are both network-accessible and BaseX would be the easiest to start using and querying. Sedna also has a schema-aware storage system which might be beneficial for the situation you describe. Berkeley's sleepycat license may be an encumbrance for you if you have a commercial use--look at it carefully.
我的偏好是使用全文搜索引擎创建倒排索引。以下是我的偏好。我建议你花时间研究一下这 3.
为什么使用全文搜索引擎?
My preference is to create inverted index using full-text search engine. Below are my preferences. I suggest you spend time on researching these 3.
Why full-text-search engines?
您肯定需要一个 XML 数据库。我认为新兴的领导者是商业产品的 MarkLogic 和开源产品的 eXist。其他人可能有其他看法。掌握新的数据库产品始终是一个陡峭的学习曲线(数据库的功能越强大,需要学习的东西就越多)。但 eXist 肯定可以破解它,不要在第一个障碍时放弃。
You definitely want an XML database. I would say the emerging leaders are MarkLogic for a commercial product, eXist for open source. Others might have other views. Getting to grips with a new database product is always a steep learning curve (and the more capable the database, the more there is to learn). But eXist can certainly hack it, don't give up at the first hurdle.
我同意米歇尔·凯的观点。如果您想要开源,请使用 eXist-db;如果您想要商业,请使用 MarkLogic。我为美国国会图书馆 NDIIPP 计划做了一个项目,经过广泛的 ATAM 分析后,我们选择 eXist,因为其活跃的用户社区和广泛的使用,优于其他系统。如果您有疑问,请在 MarkMail 上搜索。我想你会发现 eXist 比任何其他系统都有更活跃的讨论。
该报告在线约有 350 页:
http://www.mnhs .org/preserve/records/legislativerecords/pilot.htm
I agree with Michale Kay. Use eXist-db if you want open source and MarkLogic if you want commercial. I did a project for the US library of congress NDIIPP program and after an extensive ATAM analysis and we selected eXist as superior to the other systems due to its active user community and widespread use. If you have doubts just do a search on MarkMail. I think you will find that eXist has a more active discussion than any other system.
There are about 350 pages of the report on line here:
http://www.mnhs.org/preserve/records/legislativerecords/pilot.htm