帮助项目选择NoSQL数据库

发布于 2024-08-25 13:23:44 字数 2005 浏览 14 评论 0原文

有一个表：

doc_id(integer)-value(integer)

大约 100.000 doc_id 和 27.000.000 行。

对该表的多数查询 - 搜索与当前文档相似的文档：

select 10 documents with maximum of 
     (count common to current document value)/(count ov values in document).

现在我们使用 PostgreSQL。表重量（带索引）~1.5 GB。平均查询时间~0.5s——太高了。而且，在我看来，这个时间将随着数据库的增长而呈指数级增长。

我应该将所有这些转移到 NoSQL 库吗？如果是的话，该怎么办？

询问：

EXPLAIN ANALYZE
SELECT D.doc_id as doc_id,
  (count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc 
FROM testing.text_attachment D
WHERE D.doc_id !=29758 -- 29758 - is random id
  AND D.doc_crc32 IN (select testing.get_crc32_rows_by_doc_id(29758)) -- get_crc32... is IMMUTABLE
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10

Limit  (cost=95.23..95.26 rows=10 width=8) (actual time=1849.601..1849.641 rows=10 loops=1)
   ->  Sort  (cost=95.23..95.28 rows=20 width=8) (actual time=1849.597..1849.609 rows=10 loops=1)
         Sort Key: (((((count(d.doc_crc32))::numeric * 1.0) / (testing.get_count_by_doc_id(d.doc_id))::numeric))::real)
         Sort Method:  top-N heapsort  Memory: 25kB
         ->  HashAggregate  (cost=89.30..94.80 rows=20 width=8) (actual time=1211.835..1847.578 rows=876 loops=1)
               ->  Nested Loop  (cost=0.27..89.20 rows=20 width=8) (actual time=7.826..928.234 rows=167771 loops=1)
                     ->  HashAggregate  (cost=0.27..0.28 rows=1 width=4) (actual time=7.789..11.141 rows=1863 loops=1)
                           ->  Result  (cost=0.00..0.26 rows=1 width=0) (actual time=0.130..4.502 rows=1869 loops=1)
                     ->  Index Scan using crc32_idx on text_attachment d  (cost=0.00..88.67 rows=20 width=8) (actual time=0.022..0.236 rows=90 loops=1863)
                           Index Cond: (d.doc_crc32 = (testing.get_crc32_rows_by_doc_id(29758)))
                           Filter: (d.doc_id <> 29758)
 Total runtime: 1849.753 ms
(12 rows)

原文

There is a table:

doc_id(integer)-value(integer)

Approximate 100.000 doc_id and 27.000.000 rows.

Majority query on this table - searching documents similar to current document:

select 10 documents with maximum of 
     (count common to current document value)/(count ov values in document).

Nowadays we use PostgreSQL. Table weight (with index) ~1,5 GB. Average query time ~0.5s - it is to hight. And, for my opinion this time will grow exponential with growing of database.

Should I transfer all this to NoSQL base, if so, what?

QUERY:

EXPLAIN ANALYZE
SELECT D.doc_id as doc_id,
  (count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc 
FROM testing.text_attachment D
WHERE D.doc_id !=29758 -- 29758 - is random id
  AND D.doc_crc32 IN (select testing.get_crc32_rows_by_doc_id(29758)) -- get_crc32... is IMMUTABLE
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10

Limit  (cost=95.23..95.26 rows=10 width=8) (actual time=1849.601..1849.641 rows=10 loops=1)
   ->  Sort  (cost=95.23..95.28 rows=20 width=8) (actual time=1849.597..1849.609 rows=10 loops=1)
         Sort Key: (((((count(d.doc_crc32))::numeric * 1.0) / (testing.get_count_by_doc_id(d.doc_id))::numeric))::real)
         Sort Method:  top-N heapsort  Memory: 25kB
         ->  HashAggregate  (cost=89.30..94.80 rows=20 width=8) (actual time=1211.835..1847.578 rows=876 loops=1)
               ->  Nested Loop  (cost=0.27..89.20 rows=20 width=8) (actual time=7.826..928.234 rows=167771 loops=1)
                     ->  HashAggregate  (cost=0.27..0.28 rows=1 width=4) (actual time=7.789..11.141 rows=1863 loops=1)
                           ->  Result  (cost=0.00..0.26 rows=1 width=0) (actual time=0.130..4.502 rows=1869 loops=1)
                     ->  Index Scan using crc32_idx on text_attachment d  (cost=0.00..88.67 rows=20 width=8) (actual time=0.022..0.236 rows=90 loops=1863)
                           Index Cond: (d.doc_crc32 = (testing.get_crc32_rows_by_doc_id(29758)))
                           Filter: (d.doc_id <> 29758)
 Total runtime: 1849.753 ms
(12 rows)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

街角卖回忆 2024-09-01 13:23:44

1.5 GB 不算什么。从公羊上菜。构建一个可以帮助您搜索的数据结构。

回复收藏 0 原文

黄昏下泛黄的笔记 2024-09-01 13:23:44

我认为您的主要问题不是您正在使用的数据库类型，而是您实际上没有针对您正在搜索的内容的“索引”：文档之间的相似性。

我的建议是确定与 100.000 个 doc_ids 中每一个相似的 10 个文档，并将结果缓存到一个新表中，如下所示：

doc_id(integer)-similar_doc(integer)-score(integer)

您将在每个文档中插入 10 行，每行代表10 个最佳匹配。您将获得 400.000 行，您可以通过索引直接访问这些行，这应该将搜索时间减少到类似 O(log n) （取决于索引实现）。

然后，在每次插入或删除文档（或其值之一）时，您将迭代文档并相应地更新新表。

例如，当插入新文档时：
对于表中已有的每个文档，

您计算其匹配分数，
如果分数高于新表中缓存的相似文档的最低分数，则在新插入文档的similar_doc和分数中交换

I don't think your main problem here is the kind of database you're using but the fact that you don't in fact have an "index" for what you're searching: similarity between documents.

My proposal is to determine once which are the 10 documents similar to each of the 100.000 doc_ids and cache the result in a new table like this:

doc_id(integer)-similar_doc(integer)-score(integer)

where you'll insert 10 rows per document each of them representing the 10 best matches for it. You'll get 400.000 rows which you can directly access by index which should take down search time to something like O(log n) (depending on index implementation).

Then, on each insertion or removal of a document (or one of its values) you iterate through the documents and update the new table accordingly.

e.g. when a new document is inserted:
for each of the documents already in the table

you calculate its match score and
if the score is higher than the lowest score of the similar documents cached in the new table you swap in the similar_doc and score of the newly inserted document

回复收藏 0 原文

╭ゆ眷念 2024-09-01 13:23:44

如果您的 PostgreSQL 性能很差，一个好的开始是调整 PostgreSQL、您的查询以及可能的数据模型。在如此小的表上，这样的查询应该运行得更快。

回复收藏 0 原文

浅唱ヾ落雨殇 2024-09-01 13:23:44

首先，0.5sa是否有问题？您是否已经优化了查询、数据模型和配置设置？如果没有，您仍然可以获得更好的性能。性能是一种选择。

除了速度之外，还有功能，这就是您将失去的。

===

将函数推入 JOIN 怎么样：

EXPLAIN ANALYZE
SELECT 
    D.doc_id as doc_id,
    (count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc 
FROM 
    testing.text_attachment D
        JOIN (SELECT testing.get_crc32_rows_by_doc_id(29758) AS r) AS crc ON D.doc_crc32 = r
WHERE 
    D.doc_id <> 29758
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10

First, is 0.5s a problem or not? And did you already optimize your queries, datamodel and configuration settings? If not, you can still get better performance. Performance is a choice.

Besides speed, there is also functionality, that's what you will loose.

===

What about pushing the function to a JOIN:

EXPLAIN ANALYZE
SELECT 
    D.doc_id as doc_id,
    (count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc 
FROM 
    testing.text_attachment D
        JOIN (SELECT testing.get_crc32_rows_by_doc_id(29758) AS r) AS crc ON D.doc_crc32 = r
WHERE 
    D.doc_id <> 29758
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10

回复收藏 0 原文

~没有更多了~