XML 文档(XQuery?)与 MySQL 的搜索相关性
我有一个网站,其中文档保存在 xml 文档中,所有文档都具有相同的结构。
我需要一个搜索引擎,可以根据搜索用户给出的关键词选择相关性最高的文档。
我认为使用 XQuery 可能是一个好主意,而不是将信息存储两次(在 XML 文档 + mysql 数据库中)并查询 mysql 数据库以进行相关性搜索。
XQuery 对此有什么好处吗?对于超过 1000 个文档(每个文档大约 7kb),我期望什么效果以及速度如何。
谢谢您的宝贵时间。
亲切的问候
I have a website where documents are saved in xml documents, all with the same structure.
I need a search engine where I am able to choose documents with the highest relevance according to the key words given by a searching user.
I thought it could (?) be a good idea to have one using XQuery rather than having the information stored twice (in XML docs + mysql database) and querying the mysql database for relevance searches.
Is XQuery any good for this, and how, and what speed can I expect on +1000 documents of about 7kb each.
Thank you for your time.
Kind regards
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您有超过 1000 个文档正在根据查询进行搜索,那么使用 jQuery 或 SQL 数据库效率不高。
1) 对每个文档中的每个关键字进行顺序搜索将需要不少于文档数 * 每个文档中的单词数 * 关键字数
2) 每次进行搜索时,每个文档都必须重新扫描。如果你的项目涉及多次搜索,这是不可行的。
3) 顺序搜索无法根据找到的单词数量、文档中的单词总数以及每个单词的重要性等对结果进行排名...
更好的替代方法是使用倒排索引数据结构可提前“索引”您的文档和单词。
这样,您将预先做一些工作来为每个文档中的每个单词建立索引,但是在进行实际搜索时您将节省大量时间(这才是最重要的)。
另一个优点是您将能够以非临时方式对文档进行排名。请参阅向量空间模型。
If you have +1000 documents that are being searched given a query, it's not efficient using jQuery nor SQL databases.
1) Doing a sequential search through each document for every keyword will take you no less than # of documents * # of words inside each document * # of keywords
2) Each time you're doing a search, every document has to get scanned again. If you have a project that involves searching many times, this is not feasible.
3) A sequential search does not give you a way to rank your results based on how many words are found and the total number of words in a document, and the importance or each word, etc...
A better alternative is to use an Inverted Index data structure to 'index' your documents and words ahead of time.
This way, you'll do some work up front to index each word in each document, but you'll save a lot of time when doing the actual searching (which is what matters).
Another advantage is that you'll be able to rank documents in a non ad-hoc way. See the Vector Space model.
如果您想要 XML 文档的搜索解决方案(仅搜索而不是复杂的文档事务)那么我建议 Apache - Lucene 搜索引擎。
最新的 Apache Lucene 3.x 版本提供了下降搜索功能。
最重要的是,您可以使用 Apache-Solr,它使用 lucene 作为搜索引擎,具有所有管理功能、分面浏览和有效负载。
(注意:Lucene 实现也适用于所有 .NET、Java、Python、Ruby 语言)。
如果您想要一些真正基于 XQuery 且具有开源性质的解决方案 - 考虑到您的文档量,请尝试 eXist Xml 数据库。加载 eXists 数据库中的所有 Xml 文档,然后使用 XQuery。但这种方法需要 -
if you want a searching solution for the XML Documents ( only searching and not complex document transactions ) then i would suggest Apache - Lucene search engine.
Latest Apache Lucene 3.x version comes up with descent search features.
on top u can use Apache- Solr which is using lucene as search engine has all administrative features, faceted browsing and payloads.
( Note: Lucene implementation is available in all .NET, Java, Python, Ruby languages too ).
if you want some truely XQuery based solution and of open-source nature - considering your document volume try eXist Xml Database. load all your Xml Documents in eXists database and then use XQuery. But this approache requires -