存储大型可搜索文本文件的最佳方式

发布于 2024-12-05 01:40:12 字数 452 浏览 0 评论 0原文

我正在开发一个在线圣经搜索程序。圣经是一本相当大的书,纯文本占用了近 5MB 的空间。我计划在程序中实现 API,并允许其他网站包含自己的圣经搜索小部件和程序,而无需开发搜索查询或在自己的服务器上存储圣经。

考虑到这一点,我预计最终我将有适度的查询流通过程序。另外,对于那些不熟悉圣经的人,它有两种格式化文本的方法。它可以包含红色文本和斜体。我需要一种方法来存储圣经以及红字和斜体格式,但允许搜索查询忽略格式。

它还需要尽可能快速且高效(内存和 CPU 使用)。只要可以忽略格式进行查询,任何存储格式都会被考虑(MySQL、JSON 或 XML 文本文件等)。文件大小和数量并不重要,因此将书籍甚至章节分成单独的文件对我来说很好。

不过,要记住的更重要的事情是,我想要某种形式的搜索方法,可以跨多个经文进行搜索。因此,搜索“但愿永生,因为神没有差遣他的儿子”将返回约翰福音 3:16,17。感谢所有的想法!

I am developing an online Bible search program. The Bible is a pretty large book, taking up nearly 5MB of space in plain text. I am planning on implementing an API in the program as well allowing other websites to include their own Bible search widgets and programs without having to develop the search queries or storing Bibles on their own servers.

With this in mind, I am going to expect that eventually I will have a moderate flow of queries passing through the program. Also, for those not familiar with the Bible, it has 2 methods of formatting the text. It can contain both red text and italics. I need a way to store the Scriptures along with the red letter and italics formatting but allowing the search queries to ignore the formatting.

It also needs to be fast and as efficient (memory and cpu usage) as possible. Any storage format will be considered (MySQL, JSON or XML text files, etc) as long as the querying can be done ignoring the formatting. File size and count doesn't really matter, so splitting up the books or even chapters into separate files is fine by me.

One more important thing to keep in mind though, is that I want to have some form of search method that can search across multiple verses. So a search for "but have everlasting life for God sent not his Son" would return John 3:16,17. Thanks for all ideas!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

等风也等你 2024-12-12 01:40:12

有许多不同的开源文档搜索引擎,它们正是为您想要做的事情而设计的。 Solr、Elastic Search、Xapian、Whoosh、Haystack(为 Django 制作)等。 SO 和其他地方还有其他帖子讨论了使用一种与另一种的好处,但您的要求足够简单,其中任何一个都很好(并且如果您的项目起飞,可以轻松扩展,只需很少的努力,这是总是很高兴知道)。因此,看看他们的示例,看看哪一个对您来说最直观 - Solr 可以说是最受欢迎的,也是我使用过的唯一一个,但 Elastic Search 使用相同流行的 Lucene 后端,并且显然更容易启动和使用跑步,所以我会从那里开始。

至于实际的实现,如果您想要返回单个诗节(或只是诗节编号),您将希望将每个诗节索引为单独的“文档”。搜索引擎根据相关性处理结果的排名(如果您感兴趣,通常使用 tf/idf 算法)。

我处理斜体和红色文本的方法是在文本中包含某种标记(即将短语用单个星号表示斜体,用双星号表示红色),然后告诉分析器忽略这些字符 - 可能有不过,在您最终选择的框架中还有一种更简单的方法,所以请对此持保留态度。跨越多节经文的查询要求更复杂,但答案可能涉及将每一整章索引为一个文档,而不是(或者也许除了?我必须更多地考虑)每节经文。

需要注意的是 - 如果您不熟悉搜索索引,即使是像 Elastic Search 这样设计为即插即用的东西可能仍然需要一些时间和精力来设置,所以如果您绝对需要 要快速启动并运行,并且您已经熟悉 MySQL,我想它可以工作(它确实可以进行全文搜索)。但它肯定不是这项工作的最佳工具,因此,如果这是您投资的一个项目,如果您投入一点工作来学习这些搜索框架之一,那么您稍后会感谢自己。正如其他人指出的那样,就您正在处理的文本量而言,它可能有点过大,但它在您如何搜索似乎是您想要的文本方面将非常灵活。例如,稍后添加其他要求将非常简单(例如,您可以让人们将搜索限制为仅匹配红色文本)。

There are a bunch of different open source document search engines which are made for precisely what you're trying to do. Solr, Elastic Search, Xapian, Whoosh, Haystack (made for Django) and others. There are other posts on S.O. and elsewhere that go into the benefits of using one vs another, but your requirements are simple enough that any of them will be more than fine (and easily scale with very minimal effort should your project take off, which is always nice to know). So look at their examples and see which one looks most intuitive to you - Solr is arguably the most popular and it's the only one I've worked with, but Elastic Search uses the same popular Lucene backend and is apparently much easier to get up and running, so I would start there.

As for the actual implementation, you'll want to index each verse as a separate "document" if the single verse (or just verse number) is what you want to return. The search engine handles the ranking of the results based on relevancy (usually using a tf/idf algorithm, in case you're interested).

The way I'd handle the italics and red text is to include some kind of markup in the text (i.e. wrap the phrase in single asterisks for italics, double asterisks for red) and then tell the analyzer to ignore those characters - there may be a simpler way in the framework you end up choosing, though, so take that with a grain of salt. The queries spanning multiple verses requirement is more complicated, but the answer will probably involve indexing each whole chapter as a document instead of (or maybe in addition to? I'd have to think about it more) each verse.

A word of caution - if you're not familiar with search indexing, even something designed to be plug-and-play like Elastic Search will probably still require some time and effort to set up, so if you absolutely need to get this up and running quickly and you're already familiar with MySQL I suppose it could work (it does do fulltext search). But it's certainly not the best tool for the job, so if this is a project that you're invested in you will thank yourself later if you put in a little bit of work to learn one of these search frameworks. It may be overkill in terms of the amount of text you're dealing with, as others have pointed out, but it will be extremely flexible in how you can search on that text which seems to be what you want. For instance, adding other requirements later on would be very straightforward (for instance, you could let people limit their search to only matches in the red text).

穿透光 2024-12-12 01:40:12

我不知道圣经有格式。它有什么用?如果是为了经文,我建议您将每节经文存储在数据库中。在高度标准化的形式中,你会得到一张包含书籍的表格,一张包含章节的表格和一张包含诗句的表格。每节经文由节号和经文组成。

现在,我认为这些章节没有标题,所以它们实际上也只是一个数字。在这种情况下,单独存储它们是愚蠢的,所以你只有书目表和诗句表,其中每节诗都有一个章节号、一个诗节号和一个诗句文本。我认为该文本是纯文本,不是吗?

如果这节经文是纯文本,您可以通过将其存储在 MySQL 中并为其创建全文索引来轻松搜索它。这样,您可以非常有效地搜索,甚至可以使用通配符等。

如果这节经文要格式化,您可以选择创建两列,一列包含用于搜索的纯文本,一列包含用于显示的格式化文本,但我怀疑您是否需要这样做。

PS:5 MB 的文本其实不算什么。如果您有专用程序,则可以将其保存在内存中的单个字符串中,并使用 strpos 或类似的函数来查找文本。您使用什么语言、数据库和平台?

I didn't know the bible had formatting. What is it used for? If it is for the verses, I'd suggest you store every verse in a database. In a highly normalized form, you got a table with books, a table with chapters and a table with verses. Each verse consists of a verse number and a verse text.

Now, I think the chapters don't have titles so they are actually just a number as well. In that case it it silly to store them separately, so you got just your table of books and a table of verses, in which each verse has a chapter number and a verse number and a verse text. That text I think of to be plain text, isn't it?

If the verse is plain text, you can easily make it searchable by storing it in MySQL and create a FULLTEXT index for it. That way, you can search quite efficiently and even use wildcards and such.

If the verse was to have formatting, you could choose to create two columns, one with the plain text for searching, and one with the formatted text for display, but I doubt you would need this.

PS: 5 MB of text is nothing really. If you got a dedicated program, you could keep it in memory in a single string and use strpos or a similar function to find a text. What language, database and platform are you using?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文