阿拉伯文本文件搜索和索引
我正在开展一个电子图书馆项目(阿拉伯语书籍)。一个程序,允许用户将他的书籍导入系统图书馆并针对他的图书馆执行搜索。该系统向用户提供了一个基本图书馆(书籍集),用户可以稍后更新该图书馆。
为了处理搜索问题,我认为系统在数据库中有一个用于基本搜索关键字的初始表。每个搜索关键字都指向其在图书馆图书中的位置。
当用户将新书导入图书馆时会出现此问题。有两步。 首先针对新书搜索已经进入系统的关键词,看看其中是否有任何关键词出现在书中,并将其位置添加到系统中。 第二个主要障碍是在新书中确定新的搜索关键词。
我的想法(我认为这是非常糟糕和天真的)是将新书分解为标记,然后根据以前在图书馆中找到的所有书籍搜索每个标记。
总而言之,如果有任何帮助(工具、库或数据库选项)或解决第二个问题的想法或整个系统的另一个想法,我很感激。确实尝试阅读和搜索很多解决方案,但徒劳无功。
多谢,
I am working on a project of an electronic library (for Arabic books). A program that allows the user to import his books into the systems library and perform searching against his library. The system is delivered to the user with a basic library (set of books) that the user ca update later.
To handle the searching problems, i thought for the system to have an initial table in the DB for the basic searching keywords. Every search keyword points to its locations in the books in the library.
The problem appears when in the user imports a new book into the library. There are two step.
The first search the keywords that are already into the system against the new book to find if any of them appear in the book and add there location into the system.
The second, which the main stumbling block, is to identify NEW search keywords in the new book.
The idea that i have, which i think is pretty bad and naive, is to break the new book into tokens and then search each token against all the book previously found in the library.
so to sum-up, if any help (tools, libraries or DB options) or idea to solve the second problem or another idea for the whole system, i appreciate. really tried reading and searching a lot of a solution, but in-vain.
Thanks a lot,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您需要 Lucene.net。您将需要使用阿拉伯语分析器。
You want Lucene.net. You will need to use the Arabic Analyzer.
http://www.ibm.com/developerworks/java /library/os-apache-lucenesearch/index.html
http://www.ibm.com/developerworks/java/library/os-apache-lucenesearch/index.html