如何公开大量 XML 文档(~2M)以供离线查询(xpath)?

发布于 2025-01-02 04:54:20 字数 339 浏览 1 评论 0原文

我在 16GB 的文件系统空间上有不到 200 万个 XML 文档。它们都是有效的并且共享一个 DTD。它们的大小大致相同(均由同一实验室信息系统生成)。

我正在寻找一种简单的方法让单个用户查询整个 2M 文档语料库。我不想将其公开给网络甚至多个 LAN 用户;但是,我希望它能够向我的内联网公开一些查询接口。我对查询语言很灵活,但我希望能够进行临时查询。我希望它至少具有类似的性能,并且我愿意根据需要分配额外的磁盘空间来容纳索引。

一个可行的解决方案必须在具有 8GB RAM 的单个四核 Linux 机器上表现不佳,新硬件不是一个选择。

我找到了 e-Xist DB,但它似乎没有太多活动,而且演示站点也已关闭。

I have just short of 2 million XML documents sitting on 16gb of file system space. They are all valid and share a single DTD. They are all of roughly equal size (all generated by the same lab information system).

I'm looking for an easy way for a single user to query the whole 2M doc corpus. I'm not looking to expose this to the web or even multiple LAN users; however, I would like it be able to expose some query interface to my intranet. I'm flexible on the query language but I would like to be able to do ad hoc queries. I want it to be at least simi-performant and I'm willing to dedicate additional disk space as needed to accommodate indexes.

A workable solution has to be deplorable on a single quad core Linux box with 8gb of RAM, new hardware isn't an option.

I found e-Xist DB but it doesn't seem to have all that much in the way of activity and the demo site is down.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

对不⑦ 2025-01-09 04:54:20

我会按这个顺序尝试:

  1. BaseX (有很好的 GUI。我发现的最有前途的开源 XML 数据库。BSD 许可证)
  2. Sedna (在 BaseX 之前我最喜欢的。Apache 2.0 许可证)
  3. Berkeley DB-XML(是嵌入式平面文件数据库。Sleepycat 许可证)
  4. eXist (eXist 一直是一场黑客灾难。GNU LGPL 许可证)

我的预感是伯克利将会是最快的,但 BaseX 和 Sedna 都可以通过网络访问,并且 BaseX 将是最容易开始使用和查询的。 Sedna 还有一个模式感知存储系统,这可能对您描述的情况有益。如果您有商业用途,伯克利的 sleepycat 许可证可能对您来说是一种负担 - 请仔细查看。

I would try in this order:

  1. BaseX (Has nice GUI. Most promising open source XML db I've found. BSD license)
  2. Sedna (My favorite before BaseX. Apache 2.0 license)
  3. Berkeley DB-XML (Is an embedded flat-file DB. Sleepycat license)
  4. eXist (eXist has always been a hacky disaster. GNU LGPL license)

My hunch is that Berkeley would be the fastest, but BaseX and Sedna are both network-accessible and BaseX would be the easiest to start using and querying. Sedna also has a schema-aware storage system which might be beneficial for the situation you describe. Berkeley's sleepycat license may be an encumbrance for you if you have a commercial use--look at it carefully.

那片花海 2025-01-09 04:54:20

我的偏好是使用全文搜索引擎创建倒排索引。以下是我的偏好。我建议你花时间研究一下这 3.

  1. Solr (Web界面查询,容易上手)
  2. ElasticSearch(分布式,易于上手)
  3. Raw Lucene(1 和 2 在幕后使用 Lucene)

为什么使用全文搜索引擎?

  1. 更快的
  2. 突出显示
  3. Faceting
  4. 允许自由格式搜索(使用您可以使用的 xml 数据库)将针对 xpath 或 xquery 等工作)
  5. 事实证明,即使使用大量
  6. 基于文件的文件,搜索速度也更快

My preference is to create inverted index using full-text search engine. Below are my preferences. I suggest you spend time on researching these 3.

  1. Solr (Web interface for querying, easy to get started)
  2. ElasticSearch (Distributed, easy to get started)
  3. Raw Lucene (1 & 2 use Lucene behind the scenes)

Why full-text-search engines?

  1. Faster
  2. Highlighting
  3. Faceting
  4. Allows free-form search (with xml dbs you will be working against xpath or xquery or something)
  5. Proven to search faster even with huge set of files
  6. file-based
一百个冬季 2025-01-09 04:54:20

您肯定需要一个 XML 数据库。我认为新兴的领导者是商业产品的 MarkLogic 和开源产品的 eXist。其他人可能有其他看法。掌握新的数据库产品始终是一个陡峭的学习曲线(数据库的功能越强大,需要学习的东西就越多)。但 eXist 肯定可以破解它,不要在第一个障碍时放弃。

You definitely want an XML database. I would say the emerging leaders are MarkLogic for a commercial product, eXist for open source. Others might have other views. Getting to grips with a new database product is always a steep learning curve (and the more capable the database, the more there is to learn). But eXist can certainly hack it, don't give up at the first hurdle.

朮生 2025-01-09 04:54:20

我同意米歇尔·凯的观点。如果您想要开源,请使用 eXist-db;如果您想要商业,请使用 MarkLogic。我为美国国会图书馆 NDIIPP 计划做了一个项目,经过广泛的 ATAM 分析后,我们选择 eXist,因为其活跃的用户社区和广泛的使用,优于其他系统。如果您有疑问,请在 MarkMail 上搜索。我想你会发现 eXist 比任何其他系统都有更活跃的讨论。

该报告在线约有 350 页:

http://www.mnhs .org/preserve/records/legislativerecords/pilot.htm

I agree with Michale Kay. Use eXist-db if you want open source and MarkLogic if you want commercial. I did a project for the US library of congress NDIIPP program and after an extensive ATAM analysis and we selected eXist as superior to the other systems due to its active user community and widespread use. If you have doubts just do a search on MarkMail. I think you will find that eXist has a more active discussion than any other system.

There are about 350 pages of the report on line here:

http://www.mnhs.org/preserve/records/legislativerecords/pilot.htm

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文