如何以编程方式在书中搜索单词?
我需要开发一个应用程序,可以搜索一本书并列出包含给定关键字的所有页面和行。
对于以其他方式分割的书籍,例如按章和节分割的圣经; 他们将能够搜索包含特定关键字的所有诗句。 或者,在某些章节和经文中搜索关键字。
我应该将书存储为什么格式? 是否应该将其存储到 SQL 数据库中?
什么格式最容易搜索而不是最容易存储?
I need to develop an application that can search through a book and list out all the pages and lines that contain a given keyword.
For books that are split up in some other way, such as a bible which is split up by chapter and verse; they would be able to search for all verses that contain a certain keyword. Or alternatively, search within certain chapters and verses for a keyword.
What format should I store the book into? Should it be stored into a SQL database?
What format would be easiest for searching as opposed to easiest for storage?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
它取决于您想要运行它的环境,以及您期望每秒有多少个查询。
最快的方法是将哈希表中的每个单词存储到内存中,并且值包含对章节/节的引用,或者您想要检索的任何名称。
但如果书很大,或者客户端很薄,这可能无法很好地扩展。
您可以将每节经文存储在数据库记录中,并使用全文搜索进行搜索。 但是,如果您需要在网站上托管应用程序,则需要确保您选择的数据库的托管成本不超过您的预算。
如果您的应用程序负载可以处理它,您还可以将每节经文存储在文本文件(纯文本、XML 或任何其他格式)中,并扫描每个文件,最好使用 XPATH 或正则表达式。 一个非常便宜且简单的解决方案,您可以根据自己的喜好制作高级的解决方案,但可能会更慢。 那么如果您每小时只需要处理 1 个请求,为什么不呢?
我会使用带有全文搜索的数据库,因为它的扩展性最好。
It kind off depends on the environment you want to run it on, and how many queries you expect per second.
The fastest is to store every word in a hashtable into memory, and the values contain reference to the chapters/verses, or whatever you call it, you want to retrieve.
But this may not scale well if the book is very large, or the client is very thin.
You could store every verse in a database record, and search with full-text-search. But if you need to host the app on a Website, you need to ensure that the hosting costs of the database of your choice does not exceed your budget.
If your application load can handle it, you can also store every verse in a text file (plain text, XML, or any other format), and scan each file, preferably with XPATH or regular expression. A very cheap and easy solution, that you can make as advanced as you like, but probably slower. Then again if you need to service only 1 request per hour, why not?
I would use the database with full-text-search, since that scales the best.
几年前,你是一本已经存储在 Access 数据库中的圣经,我用它来制作一个与你所说的完全一样的应用程序。 Access DB 是免费下载的。 几年前,我遇到过 XML 格式的一个。 我无法在工作中做到这一点,但我建议搜索 Access Bible 或 XML Bible 并看看是否可以找到它。 (我认为最初的 Access 可能被称为 ASP 圣经)。 无论如何,如果您能找到它,它应该能让您很好地了解如何构建数据库。
Years ago thee was a Bible already stored in an Access database that I used to make an application exactly like what you're talking about. The Access DB was a free download. A few years back, I ran across one in XML. I can't do it from work but I would recommend doing a search for Access Bible or XML Bible and see if you can find it. (I think the original Access one may have been called ASP Bible). At any rate, if you can find it, it should give you a good idea of how you can structure your database.
该程序应该搜索任何书籍还是仅搜索特定的书籍? 除《圣经》之外的书籍没有像《圣经》那样将内容分为章节和诗句。 答案将取决于这本书当前的格式。
Is the program supposed to search any book or just a particular book? Books other than the Bible do not have content split up into chapter and verse like the Bible does. The answer will depend on what kind of format the book is in currently.
我建议使用现成的全文引擎,例如 Lucene.NET。 您将获得您自己无法获得的各种功能。
I would suggest using an off-the-shelf full text engine like Lucene.NET. You'll get all kinds of features you would not get if you did it yourself.
您是否期望对同一本书进行多次查询? 即,您是否想要对每本书进行预处理,这可能需要花费很多时间,但每本书只需要做一次? 否则,博伊摩尔可能是最好的选择。
您只想搜索完整的单词,还是也搜索单词的开头? 对于完整的单词,简单的哈希表可能是最快的。 如果您想查找单词的某些部分,我建议使用后缀树。
当您知道所使用的算法时,决定最佳数据结构(数据库、平面文件等)应该是一个更容易的选择。
Do you expect multiple queries for the same book? i.e. do you want to do per-book preprocessing that may take a lot of time, but has to be done only once per book? Otherwise, the boyer-moore is probably the best way to go.
Do you only want to search for complete words, or also for beginnings of words? For complete words, a simple hashtable is probably fastest. If you want to look for parts of word, I'd suggest a suffix tree.
When you know what algorithm you're using, deciding the best data structure (database, flat file, etc.) should be an easier choice.
您可以查看 Boyer-Moore (另外,此包含其原始论文的链接)算法
不幸的是, Boyer-Moore 算法在较长的字符串上比在短的“关键字”搜索上快得多。 因此,对于关键字搜索,您可能需要实现某种可以索引可能的搜索词的爬虫。
另一个令人不安的考虑是,在大多数书籍中,章节仅包含在某些页面上,而对于圣经,章节和经文可能会分为多个页面,并且页面可能包含多个经文和章节。
这意味着,如果您按诗句分割文本,那么任何跨越诗句边界的搜索短语都不会出现任何结果(或出现错误的结果)。
进一步考虑的是邻近搜索,例如您是否需要精确的搜索短语,或仅需要关键字组。
我认为第一个也是最重要的任务是敲定并强化你的要求。 然后,您应该弄清楚您将以什么格式接收书籍。一旦了解了您的限制,您就可以开始做出架构设计决策。
You could look into the Boyer-Moore (also, this contains a link to their original paper) algorithm
Unfortunately, the Boyer-Moore algorithm is much faster on longer strings than it is on short 'keyword' searches. So, for keyword searching you might want to implement some sort of crawler that could index likely search terms.
Another troubling consideration is that in most books chapters are contained on only certain pages, whereas with a bible, the chapters and verses could be split across multiple pages, and the pages could contain multiple verses and chapters.
This means that if you split up your text by verse, then any search phrases that cross verse boundaries will come up with no results (or incorrect ones).
A further consideration is the proximity search, such as whether or not you require exact search phrases, or just groups of keywords.
I think the first and most important task is to hammer down and harden your requirements. Then you should figure out what format you will be receiving the books in. Once you know your constraints, you can begin to make your architectural design decisions.
将每一行替换为特定圣经示例的文本块。 如何存储文本实际上无关紧要。 您所做的就是搜索一些给定的文本(很可能在循环中)寻找关键字。
如果您想要搜索行号和其他任意字段,最好将信息存储在具有相关字段的数据库中,并在任何相关字段上运行搜索。
仅供参考 - 上面的代码是 Python。
Substitute each line for a block of text for your specific bible example. How you store the text is really irrelevant. All you're doing is searching some given text (most likely in a loop), for a keyword.
If you want to search line numbers, and other arbitrary fields, you're best off storing the information in a database with the relevant fields and running the search on any field that is relevannt.
FYI - the code above is Python.