当前位置：文江博客话题详情

存储大型可搜索文本文件的最佳方式

发布于 2024-12-05 01:40:12 字数 452 浏览 4 评论 0原文

我正在开发一个在线圣经搜索程序。圣经是一本相当大的书，纯文本占用了近 5MB 的空间。我计划在程序中实现 API，并允许其他网站包含自己的圣经搜索小部件和程序，而无需开发搜索查询或在自己的服务器上存储圣经。

考虑到这一点，我预计最终我将有适度的查询流通过程序。另外，对于那些不熟悉圣经的人，它有两种格式化文本的方法。它可以包含红色文本和斜体。我需要一种方法来存储圣经以及红字和斜体格式，但允许搜索查询忽略格式。

它还需要尽可能快速且高效（内存和 CPU 使用）。只要可以忽略格式进行查询，任何存储格式都会被考虑（MySQL、JSON 或 XML 文本文件等）。文件大小和数量并不重要，因此将书籍甚至章节分成单独的文件对我来说很好。

不过，要记住的更重要的事情是，我想要某种形式的搜索方法，可以跨多个经文进行搜索。因此，搜索“但愿永生，因为神没有差遣他的儿子”将返回约翰福音 3:16,17。感谢所有的想法！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

等风也等你 2024-12-12 01:40:12

有许多不同的开源文档搜索引擎，它们正是为您想要做的事情而设计的。 Solr、Elastic Search、Xapian、Whoosh、Haystack（为 Django 制作）等。 SO 和其他地方还有其他帖子讨论了使用一种与另一种的好处，但您的要求足够简单，其中任何一个都很好（并且如果您的项目起飞，可以轻松扩展，只需很少的努力，这是总是很高兴知道）。因此，看看他们的示例，看看哪一个对您来说最直观 - Solr 可以说是最受欢迎的，也是我使用过的唯一一个，但 Elastic Search 使用相同流行的 Lucene 后端，并且显然更容易启动和使用跑步，所以我会从那里开始。

至于实际的实现，如果您想要返回单个诗节（或只是诗节编号），您将希望将每个诗节索引为单独的“文档”。搜索引擎根据相关性处理结果的排名（如果您感兴趣，通常使用 tf/idf 算法）。

我处理斜体和红色文本的方法是在文本中包含某种标记（即将短语用单个星号表示斜体，用双星号表示红色），然后告诉分析器忽略这些字符 - 可能有不过，在您最终选择的框架中还有一种更简单的方法，所以请对此持保留态度。跨越多节经文的查询要求更复杂，但答案可能涉及将每一整章索引为一个文档，而不是（或者也许除了？我必须更多地考虑）每节经文。

需要注意的是 - 如果您不熟悉搜索索引，即使是像 Elastic Search 这样设计为即插即用的东西可能仍然需要一些时间和精力来设置，所以如果您绝对需要要快速启动并运行，并且您已经熟悉 MySQL，我想它可以工作（它确实可以进行全文搜索）。但它肯定不是这项工作的最佳工具，因此，如果这是您投资的一个项目，如果您投入一点工作来学习这些搜索框架之一，那么您稍后会感谢自己。正如其他人指出的那样，就您正在处理的文本量而言，它可能有点过大，但它在您如何搜索似乎是您想要的文本方面将非常灵活。例如，稍后添加其他要求将非常简单（例如，您可以让人们将搜索限制为仅匹配红色文本）。

There are a bunch of different open source document search engines which are made for precisely what you're trying to do. Solr, Elastic Search, Xapian, Whoosh, Haystack (made for Django) and others. There are other posts on S.O. and elsewhere that go into the benefits of using one vs another, but your requirements are simple enough that any of them will be more than fine (and easily scale with very minimal effort should your project take off, which is always nice to know). So look at their examples and see which one looks most intuitive to you - Solr is arguably the most popular and it's the only one I've worked with, but Elastic Search uses the same popular Lucene backend and is apparently much easier to get up and running, so I would start there.

As for the actual implementation, you'll want to index each verse as a separate "document" if the single verse (or just verse number) is what you want to return. The search engine handles the ranking of the results based on relevancy (usually using a tf/idf algorithm, in case you're interested).

The way I'd handle the italics and red text is to include some kind of markup in the text (i.e. wrap the phrase in single asterisks for italics, double asterisks for red) and then tell the analyzer to ignore those characters - there may be a simpler way in the framework you end up choosing, though, so take that with a grain of salt. The queries spanning multiple verses requirement is more complicated, but the answer will probably involve indexing each whole chapter as a document instead of (or maybe in addition to? I'd have to think about it more) each verse.

A word of caution - if you're not familiar with search indexing, even something designed to be plug-and-play like Elastic Search will probably still require some time and effort to set up, so if you absolutely need to get this up and running quickly and you're already familiar with MySQL I suppose it could work (it does do fulltext search). But it's certainly not the best tool for the job, so if this is a project that you're invested in you will thank yourself later if you put in a little bit of work to learn one of these search frameworks. It may be overkill in terms of the amount of text you're dealing with, as others have pointed out, but it will be extremely flexible in how you can search on that text which seems to be what you want. For instance, adding other requirements later on would be very straightforward (for instance, you could let people limit their search to only matches in the red text).

回复收藏 0 原文