全文搜索 django: Mysql 还不错吗？（与狮身人面像、xapian）

发布于 2024-08-31 10:31:34 字数 2723 浏览 12 评论 0原文

我正在研究 django 的全文搜索引擎。它必须安装简单、快速索引、快速索引更新、索引时不阻塞、快速搜索。

阅读了许多网页后，我列出了简短的列表： Mysql MYISAM 全文、djapian/python-xapian 和 django-sphinx 我没有选择 lucene，因为它看起来很复杂，也没有选择 haystack，因为它的功能比 djapian/django-spĥinx 少（比如字段加权）。

然后我做了一些基准测试，为此，我在网上收集了许多免费书籍，生成了一个包含 1 485 000 条记录（id、title、body）的数据库表，每条记录长约 600 字节。我还从数据库中生成了 100 000 个现有单词的列表，并将它们打乱以创建搜索列表。为了进行测试，我在笔记本电脑（4Go RAM，双核 2.0Ghz）上运行了两次：第一次是在服务器重新启动后清除所有缓存，第二次是在服务器重新启动后不久进行，以测试缓存结果的好坏。以下是“自制”基准测试结果：

1485000 records with Title (150 bytes) and body (450 bytes)

Mysql 5.0.75/Ubuntu 9.04 Fulltext :
==========================================================================

Full indexing : 7m14.146s

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:11.553524
next run : 0:00:00.168508

Mysql 5.5.4 m3/Ubuntu 9.04 Fulltext :
==========================================================================

Full indexing : 6m08.154s

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:09.553524
next run : 0:00:20.316903

1 thread, 100000 searchs with single word randomly taken from database : 
First run : 9m09s
next run : 5m38s

1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:15.007353

1 thread, boolean search : 1000 x (+word1 +word2) 
First run : 0:00:21.205404
next run : 0:00:00.145098

Djapian Fulltext : 
==========================================================================

Full indexing : 84m7.601s

1 thread, 1000 searchs with single word randomly taken from database with prefetch : 
First run : 0:02:28.085680
next run : 0:00:14.300236

python-xapian Fulltext :
==========================================================================

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:26.402084
next run : 0:00:00.695092

django-sphinx Fulltext :
==========================================================================

Full indexing : 1m25.957s

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:30.073001
next run : 0:00:05.203294

1 thread, 100000 searchs with single word randomly taken from database : 
First run : 12m48s
next run : 9m45s

1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:23.535319

1 thread, boolean search : 1000 x (word1 word2) 
First run : 0:00:20.856486
next run : 0:00:03.005416

如您所见，Mysql 对于全文搜索一点也不差。此外，它的查询缓存非常高效。

在我看来，Mysql 是一个不错的选择，因为不需要安装任何东西（我只需要编写一个小脚本来将 Innodb 生产表同步到 MyISAM 搜索表），而且我并不真正需要像词干提取等高级搜索功能...

问题是：您对 Mysql 全文搜索引擎与 sphinx 和 xapian 有何看法？

原文

I am studying fulltext search engines for django.
It must be simple to install, fast indexing, fast index update, not blocking while indexing, fast search.

After reading many web pages, I put in short list :
Mysql MYISAM fulltext, djapian/python-xapian, and django-sphinx
I did not choose lucene because it seems complex, nor haystack as it has less features than djapian/django-spĥinx (like fields weighting).

Then I made some benchmarks, to do so, I collected many free books on the net to generate a database table with 1 485 000 records (id,title,body), each record is about 600 bytes long.
From the database, I also generated a list of 100 000 existing words and shuffled them to create a search list. For the tests, I made 2 runs on my laptop (4Go RAM, Dual core 2.0Ghz): the first one, just after a server reboot to clear all caches, the second is done juste after in order to test how good are cached results. Here are the "home made" benchmark results :

1485000 records with Title (150 bytes) and body (450 bytes)

Mysql 5.0.75/Ubuntu 9.04 Fulltext :
==========================================================================

Full indexing : 7m14.146s

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:11.553524
next run : 0:00:00.168508

Mysql 5.5.4 m3/Ubuntu 9.04 Fulltext :
==========================================================================

Full indexing : 6m08.154s

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:09.553524
next run : 0:00:20.316903

1 thread, 100000 searchs with single word randomly taken from database : 
First run : 9m09s
next run : 5m38s

1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:15.007353

1 thread, boolean search : 1000 x (+word1 +word2) 
First run : 0:00:21.205404
next run : 0:00:00.145098

Djapian Fulltext : 
==========================================================================

Full indexing : 84m7.601s

1 thread, 1000 searchs with single word randomly taken from database with prefetch : 
First run : 0:02:28.085680
next run : 0:00:14.300236

python-xapian Fulltext :
==========================================================================

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:26.402084
next run : 0:00:00.695092

django-sphinx Fulltext :
==========================================================================

Full indexing : 1m25.957s

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:30.073001
next run : 0:00:05.203294

1 thread, 100000 searchs with single word randomly taken from database : 
First run : 12m48s
next run : 9m45s

1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:23.535319

1 thread, boolean search : 1000 x (word1 word2) 
First run : 0:00:20.856486
next run : 0:00:03.005416

As you can see, Mysql is not so bad at all for fulltext search. In addition, its query cache is very efficient.

Mysql seems to me a good choice as there is nothing to install (I need just to write a small script to synchronize an Innodb production table to a MyISAM search table) and as I do not really need advanced search feature like stemming etc...

Here is the question : What do you think about Mysql fulltext search engine vs sphinx and xapian ?

分享到QQ

分享到微博