全文搜索 django: Mysql 还不错吗? (与狮身人面像、xapian)
我正在研究 django 的全文搜索引擎。 它必须安装简单、快速索引、快速索引更新、索引时不阻塞、快速搜索。
阅读了许多网页后,我列出了简短的列表: Mysql MYISAM 全文、djapian/python-xapian 和 django-sphinx 我没有选择 lucene,因为它看起来很复杂,也没有选择 haystack,因为它的功能比 djapian/django-spĥinx 少(比如字段加权)。
然后我做了一些基准测试,为此,我在网上收集了许多免费书籍,生成了一个包含 1 485 000 条记录(id、title、body)的数据库表,每条记录长约 600 字节。 我还从数据库中生成了 100 000 个现有单词的列表,并将它们打乱以创建搜索列表。为了进行测试,我在笔记本电脑(4Go RAM,双核 2.0Ghz)上运行了两次:第一次是在服务器重新启动后清除所有缓存,第二次是在服务器重新启动后不久进行,以测试缓存结果的好坏。以下是“自制”基准测试结果:
1485000 records with Title (150 bytes) and body (450 bytes)
Mysql 5.0.75/Ubuntu 9.04 Fulltext :
==========================================================================
Full indexing : 7m14.146s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:11.553524
next run : 0:00:00.168508
Mysql 5.5.4 m3/Ubuntu 9.04 Fulltext :
==========================================================================
Full indexing : 6m08.154s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:09.553524
next run : 0:00:20.316903
1 thread, 100000 searchs with single word randomly taken from database :
First run : 9m09s
next run : 5m38s
1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:15.007353
1 thread, boolean search : 1000 x (+word1 +word2)
First run : 0:00:21.205404
next run : 0:00:00.145098
Djapian Fulltext :
==========================================================================
Full indexing : 84m7.601s
1 thread, 1000 searchs with single word randomly taken from database with prefetch :
First run : 0:02:28.085680
next run : 0:00:14.300236
python-xapian Fulltext :
==========================================================================
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:26.402084
next run : 0:00:00.695092
django-sphinx Fulltext :
==========================================================================
Full indexing : 1m25.957s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:30.073001
next run : 0:00:05.203294
1 thread, 100000 searchs with single word randomly taken from database :
First run : 12m48s
next run : 9m45s
1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:23.535319
1 thread, boolean search : 1000 x (word1 word2)
First run : 0:00:20.856486
next run : 0:00:03.005416
如您所见,Mysql 对于全文搜索一点也不差。此外,它的查询缓存非常高效。
在我看来,Mysql 是一个不错的选择,因为不需要安装任何东西(我只需要编写一个小脚本来将 Innodb 生产表同步到 MyISAM 搜索表),而且我并不真正需要像词干提取等高级搜索功能...
问题是:您对 Mysql 全文搜索引擎与 sphinx 和 xapian 有何看法?
I am studying fulltext search engines for django.
It must be simple to install, fast indexing, fast index update, not blocking while indexing, fast search.
After reading many web pages, I put in short list :
Mysql MYISAM fulltext, djapian/python-xapian, and django-sphinx
I did not choose lucene because it seems complex, nor haystack as it has less features than djapian/django-spĥinx (like fields weighting).
Then I made some benchmarks, to do so, I collected many free books on the net to generate a database table with 1 485 000 records (id,title,body), each record is about 600 bytes long.
From the database, I also generated a list of 100 000 existing words and shuffled them to create a search list. For the tests, I made 2 runs on my laptop (4Go RAM, Dual core 2.0Ghz): the first one, just after a server reboot to clear all caches, the second is done juste after in order to test how good are cached results. Here are the "home made" benchmark results :
1485000 records with Title (150 bytes) and body (450 bytes)
Mysql 5.0.75/Ubuntu 9.04 Fulltext :
==========================================================================
Full indexing : 7m14.146s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:11.553524
next run : 0:00:00.168508
Mysql 5.5.4 m3/Ubuntu 9.04 Fulltext :
==========================================================================
Full indexing : 6m08.154s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:09.553524
next run : 0:00:20.316903
1 thread, 100000 searchs with single word randomly taken from database :
First run : 9m09s
next run : 5m38s
1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:15.007353
1 thread, boolean search : 1000 x (+word1 +word2)
First run : 0:00:21.205404
next run : 0:00:00.145098
Djapian Fulltext :
==========================================================================
Full indexing : 84m7.601s
1 thread, 1000 searchs with single word randomly taken from database with prefetch :
First run : 0:02:28.085680
next run : 0:00:14.300236
python-xapian Fulltext :
==========================================================================
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:26.402084
next run : 0:00:00.695092
django-sphinx Fulltext :
==========================================================================
Full indexing : 1m25.957s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:30.073001
next run : 0:00:05.203294
1 thread, 100000 searchs with single word randomly taken from database :
First run : 12m48s
next run : 9m45s
1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:23.535319
1 thread, boolean search : 1000 x (word1 word2)
First run : 0:00:20.856486
next run : 0:00:03.005416
As you can see, Mysql is not so bad at all for fulltext search. In addition, its query cache is very efficient.
Mysql seems to me a good choice as there is nothing to install (I need just to write a small script to synchronize an Innodb production table to a MyISAM search table) and as I do not really need advanced search feature like stemming etc...
Here is the question : What do you think about Mysql fulltext search engine vs sphinx and xapian ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我还没有测试过 Xapian,但我去年做了一个比较全文解决方案的演示:
http://www.slideshare.net/billkarwin/ practical-full-text-search-with-my-sql
Sphinx 的搜索速度是最快的。但对增量传入的数据建立索引很困难,因为向索引添加数据与从头开始创建整个索引的成本差不多。
因此,有些人维护两个 Sphinx 索引:一个包含存档数据的大索引,一个包含最新数据的小索引。他们定期(例如每周)将最近的索引合并到存档索引中(合并两个索引的成本较低),并截断小索引以为新的一周做准备。这对于论坛之类的东西非常有用,但对于 wiki 就不那么好了。
您还可以查看 Apache Solr。这是 Lucene 的包装器,它使 Lucene 的使用变得更加容易,而且功能更加丰富。当我设计那个演示文稿时,我并不知道 Solr。
《华盛顿时报》是一个将 Solr 与 Django 结合使用的项目示例:
I haven't tested Xapian but I did a presentation last year comparing fulltext solutions:
http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql
Sphinx is the fastest at searches. But it's hard to index data that comes in incrementally, because adding data to an index is about as expensive as creating the whole index from scratch.
So some people maintain two Sphinx indexes: one large index with archived data, and one small index with recent data. Periodically (e.g. weekly) they merge the recent index into the archived index (merging two indexes is less expensive), and truncate the small index to prepare for the new week. This works great for something like a forum, but not as good for a wiki.
You might also check out Apache Solr. This is a wrapper for Lucene, and it makes using Lucene a lot easier and yet more featureful. I didn't know about Solr when I designed that presentation.
The Washington Times is an example of a project that uses Solr together with Django:
如果您可以使用 MyISAM 的全文那就太好了。将其内置到数据库中当然很方便,因此您可以通过与其他表的联接轻松且相对有效地执行搜索。布尔模式搜索很棒。
缺点是它在匹配单词方面相当初级。显然没有词干,但也没有对连字符/撇号进行特殊处理,以及默认的最小字长和 停止列表 太过分了。 (当软件认为“howbeit”是一个常用的单词时,请担心!)
最糟糕的是:当然它是令人讨厌的旧 MyISAM 所独有的,因此它不会很好地放置在您的 InnoDB 表中。 (您正在使用InnoDB,对吧?)
If you can get by with MyISAM's fulltext then great. It's certainly handy having it built-in to the database so you can easily and relatively-efficiently perform searches with joins to other tables. And boolean mode search is great.
The downside is it's pretty rudimentary at matching words. Obviously no stemming, but also no special handling of hyphen/apostrophe, and the default min word length and stop list is brutally excessive. (When software thinks “howbeit” is a commonly-typed word, worry!)
Worst: of course it's exclusive to nasty old MyISAM, so it's not going to sit nicely in your InnoDB tables. (You are using InnoDB, right?)
您还可以考虑 SphinxQL,它将 MySQL 全文功能的易用性与Sphinx 的强大功能和灵活性。
安装说明此处
You also might consider SphinxQL which sort of combines the ease of using MySQL's full text capabilities with the power and flexibility of Sphinx.
Installation instructions here