在 Postgres 或 CouchDB 中进行全文搜索?

发布于 2024-10-21 19:06:17 字数 486 浏览 1 评论 0原文

我访问了 geonames.org 并导入了德国城市及其所有地区的所有数据。

如果我输入“汉堡”,它会列出“汉堡中心、汉堡机场”等。该应用程序位于封闭网络中,无法访问互联网,因此我无法访问 geonames.org Web 服务并且必须导入数据。 :( 这座城市及其所有地区都像一个汽车整体一样运转。因此,每次按键都会产生 XHR 请求,依此类推。

现在我的客户问是否有可能将世界上所有的数据都存入其中。最后,大约 5.000.000 行和 45.000.000 个备用名称等。Postgres

每个查询需要大约 3 秒,这使得自动完成功能无法使用。

现在我想到了CouchDb,已经使用过它了。我的问题:

我想发布“Ham”,并且希望 CouchDB 获取以“Ham”开头的所有文档。如果我输入“汉堡”,我希望它返回汉堡等。

CouchDB 是合适的数据库吗?您可以推荐哪些其他数据库能够以低延迟(可能在内存中)和数百万个数据集进行响应?数据集不会定期更改,它相当静态!

I took geonames.org and imported all their data of German cities with all districts.

If I enter "Hamburg", it lists "Hamburg Center, Hamburg Airport" and so on. The application is in a closed network with no access to the internet, so I can't access the geonames.org web services and have to import the data. :(
The city with all of its districts works as an auto complete. So each key hit results in an XHR request and so on.

Now my customer asked whether it is possible to have all data of the world in it. Finally, about 5.000.000 rows with 45.000.000 alternative names etc.

Postgres needs about 3 seconds per query which makes the auto complete unusable.

Now I thought of CouchDb, have already worked with it. My question:

I would like to post "Ham" and I want CouchDB to get all documents starting with "Ham". If I enter "Hamburg" I want it to return Hamburg and so forth.

Is CouchDB the right database for it? Which other DBs can you recommend that respond with low latency (may be in-memory) and millions of datasets? The dataset doesn't change regularly, it's rather static!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

爺獨霸怡葒院 2024-10-28 19:06:17

如果我理解你的问题正确的话,可能你所需要的一切都已经内置在 CouchDB 中。

  1. 获取一系列名称以“Ham”开头的文档。您可以使用带有字符串范围的请求startkey="Ham"&endkey="Ham\ufff0"
  2. 如果您需要更全面的搜索,您可以创建一个包含其他地点名称作为键的视图。因此,您可以再次使用上述技术查询范围。

这是一个实现此目的的视图函数:

function(doc) {
    for (var name in doc.places) {
        emit(name, doc._id);
    }
}

另请参阅有关 CouchDB typeahead 和自动完成搜索 以及邮件列表上有关 CouchDB 自动完成

If I understand your problem right, probably all you need is already built in the CouchDB.

  1. To get a range of documents with names beginning with e.g. "Ham". You may use a request with a string range: startkey="Ham"&endkey="Ham\ufff0"
  2. If you need a more comprehensive search, you may create a view containing names of other places as keys. So you again can query ranges using the technique above.

Here is a view function to make this:

function(doc) {
    for (var name in doc.places) {
        emit(name, doc._id);
    }
}

Also see the CouchOne blog post about CouchDB typeahead and autocomplete search and this discussion on the mailing list about CouchDB autocomplete.

三生一梦 2024-10-28 19:06:17

使用 PostgreSQL 优化搜索

您的搜索锚定在开头并且不需要模糊搜索逻辑。这不是全文搜索的典型用例。

如果它变得更加模糊或者您的搜索没有从一开始就锚定,请在此处查找更多信息:

在 PostgreSQL 中,您可以利用高级索引功能,这将使查询非常快。特别是,请查看 运算符类表达式索引

1. text_pattern_ops

假设您的列是文本类型,您将为文本模式运算符使用特殊索引,如下所示:

CREATE INDEX name_text_pattern_ops_idx
ON tbl (name text_pattern_ops);

SELECT name
FROM   tbl
WHERE  name ~~ ('Hambu' || '%');

这是假设您使用的数据库区域设置不是C - 在您的情况下很可能是 de_DE.UTF-8 。您还可以使用语言环境“C”设置数据库。我在这里引用手册

如果您确实使用 C 语言环境,则不需要 xxx_pattern_ops
运算符类,因为具有默认运算符类的索引是
可用于 C 语言环境中的模式匹配查询。

更新:在现代 Postgres 中,考虑COLLATE "C"。请参阅:

表达式索引

2.我想你的 还希望搜索不区分大小写。因此,让我们采取另一个步骤,使其成为表达式的索引:

CREATE INDEX lower_name_text_pattern_ops_idx
ON tbl (lower(name) text_pattern_ops);

SELECT name
FROM   tbl
WHERE  lower(name) ~~ (lower('Hambu') || '%');

要使用索引,WHERE 子句必须与索引表达式匹配。

3. 优化索引大小和速度

最后,您可能还想对前导字符的数量施加限制,以最小化索引的大小并进一步加快速度:

CREATE INDEX lower_left_name_text_pattern_ops_idx
ON tbl (lower(left(name,10)) text_pattern_ops);

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~~ (lower('Hambu') || '%');

left() 是在 Postgres 9.1 中引入的。使用 substring(name, 1,10) 在旧版本中。

4.覆盖所有可能的请求

超过10个字符的字符串怎么办?

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~~ (lower(left('Hambu678910',10)) || '%');
AND    lower(name) ~~ (lower('Hambu678910') || '%');

这看起来多余,但是您需要以这种方式拼写出来才能实际使用索引。索引搜索会将其范围缩小到几个条目,附加子句会过滤其余条目。尝试找到最佳点。取决于数据分布和典型用例。 10 个字符似乎是一个很好的起点。对于超过 10 个字符,left() 有效地变成了一种非常快速且简单的哈希算法,对于许多(但不是全部)用例来说已经足够好了。

更新:在现代 Postgres 中,请考虑使用 ^@ 运算符。请参阅:

5. 使用<优化磁盘表示code>CLUSTER

因此,主要的访问模式将是根据我们的索引 lower_left_name_text_pattern_ops_idx 检索一堆相邻行。而你主要阅读,几乎不写作。这是CLUSTER的教科书案例手册

当表被集群时,它会根据索引信息进行物理重新排序。

对于像您这样的巨大表,这可以显着提高响应时间,因为要获取的所有行都位于磁盘上的相同或相邻块中。

第一次调用:

CLUSTER tbl USING lower_left_name_text_pattern_ops_idx;

将保存要使用哪个索引的信息,连续的调用将对表重新集群:

CLUSTER tbl;
CLUSTER;    -- cluster all tables in the db that have previously been clustered.

如果您不想重复:

ALTER TABLE tbl SET WITHOUT CLUSTER;

但是,CLUSTER 在表上采用独占锁。如果这是一个问题,请查看 pg_repackpg_squeeze,无需对表进行独占锁即可执行相同操作。

6. 防止结果中出现太多行

要求搜索字符串至少包含 3 或 4 个字符。我添加这个是为了完整性,你可能还是这样做。
LIMIT 返回的行数:

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~~ (lower('Hambu') || '%')
LIMIT  501;

如果您的查询返回超过 500 行,请告诉用户缩小搜索范围。

7. 优化过滤方法(运算符)

如果您绝对必须挤出最后一微秒,则可以使用 text_pattern_ops 系列。就像这样:

SELECT name
FROM   tbl
WHERE  lower(left(name, 10)) ~>=~ lower('Hambu')
AND    lower(left(name, 10)) ~<=~ (lower('Hambu') || chr(2097151));

最后这个特技你收获甚微。通常,标准运算符是更好的选择。


如果您做到了这一切,搜索时间将减少到几毫秒。

Optimized search with PostgreSQL

Your search is anchored at the start and no fuzzy search logic is required. This is not the typical use case for full text search.

If it gets more fuzzy or your search is not anchored at the start, look here for more:

In PostgreSQL you can make use of advanced index features that should make the query very fast. In particular, look at operator classes and indexes on expressions.

1. text_pattern_ops

Assuming your column is of type text, you would use a special index for text pattern operators like this:

CREATE INDEX name_text_pattern_ops_idx
ON tbl (name text_pattern_ops);

SELECT name
FROM   tbl
WHERE  name ~~ ('Hambu' || '%');

This is assuming that you operate with a database locale other than C - most likely de_DE.UTF-8 in your case. You could also set up a database with locale 'C'. I quote the manual here:

If you do use the C locale, you do not need the xxx_pattern_ops
operator classes, because an index with the default operator class is
usable for pattern-matching queries in the C locale.

Update: In modern Postgres consider COLLATE "C". See:

2. Index on expression

I'd imagine you would also want to make that search case insensitive. so let's take another step and make that an index on an expression:

CREATE INDEX lower_name_text_pattern_ops_idx
ON tbl (lower(name) text_pattern_ops);

SELECT name
FROM   tbl
WHERE  lower(name) ~~ (lower('Hambu') || '%');

To make use of the index, the WHERE clause has to match the the index expression.

3. Optimize index size and speed

Finally, you might also want to impose a limit on the number of leading characters to minimize the size of your index and speed things up even further:

CREATE INDEX lower_left_name_text_pattern_ops_idx
ON tbl (lower(left(name,10)) text_pattern_ops);

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~~ (lower('Hambu') || '%');

left() was introduced with Postgres 9.1. Use substring(name, 1,10) in older versions.

4. Cover all possible requests

What about strings with more than 10 characters?

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~~ (lower(left('Hambu678910',10)) || '%');
AND    lower(name) ~~ (lower('Hambu678910') || '%');

This looks redundant, but you need to spell it out this way to actually use the index. Index search will narrow it down to a few entries, the additional clause filters the rest. Experiment to find the sweet spot. Depends on data distribution and typical use cases. 10 characters seem like a good starting point. For more than 10 characters, left() effectively turns into a very fast and simple hashing algorithm that's good enough for many (but not all) use cases.

Update: in modern Postgres consider the ^@ operator instead. See:

5. Optimize disk representation with CLUSTER

So, the predominant access pattern will be to retrieve a bunch of adjacent rows according to our index lower_left_name_text_pattern_ops_idx. And you mostly read and hardly ever write. This is a textbook case for CLUSTER. The manual:

When a table is clustered, it is physically reordered based on the index information.

With a huge table like yours, this can dramatically improve response time because all rows to be fetched are in the same or adjacent blocks on disk.

First call:

CLUSTER tbl USING lower_left_name_text_pattern_ops_idx;

Information which index to use will be saved and successive calls will re-cluster the table:

CLUSTER tbl;
CLUSTER;    -- cluster all tables in the db that have previously been clustered.

If you don't want to repeat it:

ALTER TABLE tbl SET WITHOUT CLUSTER;

However, CLUSTER takes an exclusive lock on the table. If that's a problem, look into pg_repack or pg_squeeze, which can do the same without exclusive lock on the table.

6. Prevent too many rows in the result

Demand a minimum of, say, 3 or 4 characters for the search string. I add this for completeness, you probably do it anyway.
And LIMIT the number of rows returned:

SELECT name
FROM   tbl
WHERE  lower(left(name,10)) ~~ (lower('Hambu') || '%')
LIMIT  501;

If your query returns more than 500 rows, tell the user to narrow down his search.

7. Optimize filter method (operators)

If you absolutely must squeeze out every last microsecond, you can utilize operators of the text_pattern_ops family. Like this:

SELECT name
FROM   tbl
WHERE  lower(left(name, 10)) ~>=~ lower('Hambu')
AND    lower(left(name, 10)) ~<=~ (lower('Hambu') || chr(2097151));

You gain very little with this last stunt. Normally, standard operators are the better choice.


If you do all that, search time will be reduced to a matter of milliseconds.

油饼 2024-10-28 19:06:17

我认为更好的方法是将数据保存在数据库(Postgres 或 CouchDB)上,并使用全文搜索引擎对其进行索引,例如 LuceneSolrElasticSearch.

话虽如此,有一个将 CouchDB 与 Lucene 集成的项目

I think a better approach is keep your data on your database (Postgres or CouchDB) and index it with a full-text search engine, like Lucene, Solr or ElasticSearch.

Having said that, there's a project integrating CouchDB with Lucene.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文