在 Postgres 或 CouchDB 中进行全文搜索?
我访问了 geonames.org 并导入了德国城市及其所有地区的所有数据。
如果我输入“汉堡”,它会列出“汉堡中心、汉堡机场”等。该应用程序位于封闭网络中,无法访问互联网,因此我无法访问 geonames.org Web 服务并且必须导入数据。 :( 这座城市及其所有地区都像一个汽车整体一样运转。因此,每次按键都会产生 XHR 请求,依此类推。
现在我的客户问是否有可能将世界上所有的数据都存入其中。最后,大约 5.000.000 行和 45.000.000 个备用名称等。Postgres
每个查询需要大约 3 秒,这使得自动完成功能无法使用。
现在我想到了CouchDb,已经使用过它了。我的问题:
我想发布“Ham”,并且希望 CouchDB 获取以“Ham”开头的所有文档。如果我输入“汉堡”,我希望它返回汉堡等。
CouchDB 是合适的数据库吗?您可以推荐哪些其他数据库能够以低延迟(可能在内存中)和数百万个数据集进行响应?数据集不会定期更改,它相当静态!
I took geonames.org and imported all their data of German cities with all districts.
If I enter "Hamburg", it lists "Hamburg Center, Hamburg Airport" and so on. The application is in a closed network with no access to the internet, so I can't access the geonames.org web services and have to import the data. :(
The city with all of its districts works as an auto complete. So each key hit results in an XHR request and so on.
Now my customer asked whether it is possible to have all data of the world in it. Finally, about 5.000.000 rows with 45.000.000 alternative names etc.
Postgres needs about 3 seconds per query which makes the auto complete unusable.
Now I thought of CouchDb, have already worked with it. My question:
I would like to post "Ham" and I want CouchDB to get all documents starting with "Ham". If I enter "Hamburg" I want it to return Hamburg and so forth.
Is CouchDB the right database for it? Which other DBs can you recommend that respond with low latency (may be in-memory) and millions of datasets? The dataset doesn't change regularly, it's rather static!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果我理解你的问题正确的话,可能你所需要的一切都已经内置在 CouchDB 中。
startkey="Ham"&endkey="Ham\ufff0"
这是一个实现此目的的视图函数:
另请参阅有关 CouchDB typeahead 和自动完成搜索 以及邮件列表上有关 CouchDB 自动完成。
If I understand your problem right, probably all you need is already built in the CouchDB.
startkey="Ham"&endkey="Ham\ufff0"
Here is a view function to make this:
Also see the CouchOne blog post about CouchDB typeahead and autocomplete search and this discussion on the mailing list about CouchDB autocomplete.
使用 PostgreSQL 优化搜索
您的搜索锚定在开头并且不需要模糊搜索逻辑。这不是全文搜索的典型用例。
如果它变得更加模糊或者您的搜索没有从一开始就锚定,请在此处查找更多信息:
在 PostgreSQL 中,您可以利用高级索引功能,这将使查询非常快。特别是,请查看 运算符类 和 表达式索引。
1.
text_pattern_ops
假设您的列是文本类型,您将为文本模式运算符使用特殊索引,如下所示:
这是假设您使用的数据库区域设置不是
C
- 在您的情况下很可能是de_DE.UTF-8
。您还可以使用语言环境“C”设置数据库。我在这里引用手册:更新:在现代 Postgres 中,考虑
COLLATE "C"
。请参阅:表达式索引
2.我想你的 还希望搜索不区分大小写。因此,让我们采取另一个步骤,使其成为表达式的索引:
要使用索引,
WHERE
子句必须与索引表达式匹配。3. 优化索引大小和速度
最后,您可能还想对前导字符的数量施加限制,以最小化索引的大小并进一步加快速度:
left()
是在 Postgres 9.1 中引入的。使用substring(name, 1,10)
在旧版本中。4.覆盖所有可能的请求
超过10个字符的字符串怎么办?
这看起来多余,但是您需要以这种方式拼写出来才能实际使用索引。索引搜索会将其范围缩小到几个条目,附加子句会过滤其余条目。尝试找到最佳点。取决于数据分布和典型用例。 10 个字符似乎是一个很好的起点。对于超过 10 个字符,
left()
有效地变成了一种非常快速且简单的哈希算法,对于许多(但不是全部)用例来说已经足够好了。更新:在现代 Postgres 中,请考虑使用
^@
运算符。请参阅:5. 使用<优化磁盘表示code>CLUSTER
因此,主要的访问模式将是根据我们的索引
lower_left_name_text_pattern_ops_idx
检索一堆相邻行。而你主要阅读,几乎不写作。这是CLUSTER的教科书案例。 手册:对于像您这样的巨大表,这可以显着提高响应时间,因为要获取的所有行都位于磁盘上的相同或相邻块中。
第一次调用:
将保存要使用哪个索引的信息,连续的调用将对表重新集群:
如果您不想重复:
但是,
CLUSTER
在表上采用独占锁。如果这是一个问题,请查看pg_repack
或pg_squeeze
,无需对表进行独占锁即可执行相同操作。6. 防止结果中出现太多行
要求搜索字符串至少包含 3 或 4 个字符。我添加这个是为了完整性,你可能还是这样做。
LIMIT
返回的行数:如果您的查询返回超过 500 行,请告诉用户缩小搜索范围。
7. 优化过滤方法(运算符)
如果您绝对必须挤出最后一微秒,则可以使用
text_pattern_ops
系列。就像这样:最后这个特技你收获甚微。通常,标准运算符是更好的选择。
如果您做到了这一切,搜索时间将减少到几毫秒。
Optimized search with PostgreSQL
Your search is anchored at the start and no fuzzy search logic is required. This is not the typical use case for full text search.
If it gets more fuzzy or your search is not anchored at the start, look here for more:
In PostgreSQL you can make use of advanced index features that should make the query very fast. In particular, look at operator classes and indexes on expressions.
1.
text_pattern_ops
Assuming your column is of type text, you would use a special index for text pattern operators like this:
This is assuming that you operate with a database locale other than
C
- most likelyde_DE.UTF-8
in your case. You could also set up a database with locale 'C'. I quote the manual here:Update: In modern Postgres consider
COLLATE "C"
. See:2. Index on expression
I'd imagine you would also want to make that search case insensitive. so let's take another step and make that an index on an expression:
To make use of the index, the
WHERE
clause has to match the the index expression.3. Optimize index size and speed
Finally, you might also want to impose a limit on the number of leading characters to minimize the size of your index and speed things up even further:
left()
was introduced with Postgres 9.1. Usesubstring(name, 1,10)
in older versions.4. Cover all possible requests
What about strings with more than 10 characters?
This looks redundant, but you need to spell it out this way to actually use the index. Index search will narrow it down to a few entries, the additional clause filters the rest. Experiment to find the sweet spot. Depends on data distribution and typical use cases. 10 characters seem like a good starting point. For more than 10 characters,
left()
effectively turns into a very fast and simple hashing algorithm that's good enough for many (but not all) use cases.Update: in modern Postgres consider the
^@
operator instead. See:5. Optimize disk representation with
CLUSTER
So, the predominant access pattern will be to retrieve a bunch of adjacent rows according to our index
lower_left_name_text_pattern_ops_idx
. And you mostly read and hardly ever write. This is a textbook case forCLUSTER
. The manual:With a huge table like yours, this can dramatically improve response time because all rows to be fetched are in the same or adjacent blocks on disk.
First call:
Information which index to use will be saved and successive calls will re-cluster the table:
If you don't want to repeat it:
However,
CLUSTER
takes an exclusive lock on the table. If that's a problem, look intopg_repack
orpg_squeeze
, which can do the same without exclusive lock on the table.6. Prevent too many rows in the result
Demand a minimum of, say, 3 or 4 characters for the search string. I add this for completeness, you probably do it anyway.
And
LIMIT
the number of rows returned:If your query returns more than 500 rows, tell the user to narrow down his search.
7. Optimize filter method (operators)
If you absolutely must squeeze out every last microsecond, you can utilize operators of the
text_pattern_ops
family. Like this:You gain very little with this last stunt. Normally, standard operators are the better choice.
If you do all that, search time will be reduced to a matter of milliseconds.
我认为更好的方法是将数据保存在数据库(Postgres 或 CouchDB)上,并使用全文搜索引擎对其进行索引,例如 Lucene、Solr 或 ElasticSearch.
话虽如此,有一个将 CouchDB 与 Lucene 集成的项目。
I think a better approach is keep your data on your database (Postgres or CouchDB) and index it with a full-text search engine, like Lucene, Solr or ElasticSearch.
Having said that, there's a project integrating CouchDB with Lucene.