当前位置：文江博客话题详情

大型连接表和缩放

发布于 2024-12-01 05:30:57 字数 3110 浏览 1 评论 0原文

问题

我们有一个快速增长的数据库，其中有几个大型连接表（目前有数十亿行），但随着这些表的增长，查询时间受到影响。令人担忧的是，随着更多数据添加到这些连接表链接的表中，连接表将继续以更快的速度增长，并对查询速度产生不利影响。

背景

我正在处理一个存储基因组信息的数据库。与存在 DNA 变异的位点相对应的许多标记（约 300 万）与已在这些位点确定了基因型的个体相关联。每个标记都有许多可能的基因型，每个人都必须拥有其中一种。

当前的实现

当数据库（postgresql）还很小时，使用外键将基因型链接到标记，然后通过连接表将个体链接到他们的基因型，没有任何问题。这样，就可以轻松查找个体的所有基因型或查找具有特定基因型的所有个体。

下面列出了这些表的精简版本：

                                        Table "public.genotypes"
      Column      |            Type             |                       Modifiers                        
------------------+-----------------------------+--------------------------------------------------------
 id               | integer                     | not null default nextval('genotypes_id_seq'::regclass)
 ref_variation_id | integer                     | 
 value            | character varying(255)      |  
Indexes:
    "genotypes_pkey" PRIMARY KEY, btree (id)
    "index_genotypes_on_ref_variation_id" btree (ref_variation_id)


Table "public.genotypes_individuals"
    Column     |  Type   | Modifiers 
---------------+---------+-----------
 genotype_id   | integer | 
 individual_id | integer | 
Indexes:
    "index_genotypes_individuals_on_genotype_id_and_individual_id" UNIQUE, btree (genotype_id, individual_id)
    "index_genotypes_individuals_on_genotype_id" btree (genotype_id)

                                       Table "public.individuals"
    Column     |            Type             |                        Modifiers                         
---------------+-----------------------------+----------------------------------------------------------
 id            | integer                     | not null default nextval('individuals_id_seq'::regclass)
 hap_id        | character varying(255)      | 
 population_id | integer                     | 
 sex           | character varying(255)      | 
Indexes:
    "individuals_pkey" PRIMARY KEY, btree (id)
    "index_individuals_on_hap_id" UNIQUE, btree (hap_id)

现在的瓶颈是查找个体的所有基因型并按位置对它们进行排序。这是经常使用的，并且比从基因型查找个体重要得多。其中一些查询的示例如下：

对个人所有基因型的简单查找
SELECT * FROM "genotypes" INNER JOIN "genotypes_individuals" ON "genotypes".id = "genotypes_individuals".genotype_id WHERE ("genotypes_individuals".individual_id = 2946 )
通常情况下，这会受到限制，因为有很多基因型。我们通常只对特定染色体上的那些感兴趣。
SELECT * FROM "genotypes" INNER JOIN "genotypes_individuals" ON "genotypes".id = "genotypes_individuals".genotype_id WHERE ("genotypes_individuals".individual_id = 2946 ) AND ("genotypes".ref_variation_id IN (37142, 37143，...))
我们仍然需要偶尔走另一条路。
SELECT * FROM "individuals" INNER JOIN "genotypes_individuals" ON "individuals".id = "genotypes_individuals".individual_id WHERE ("genotypes_individuals".genotype_id = 53430)

每次将新个体添加到数据库时，连接表都会增长约300 万行。直观地从设计角度来看，这似乎很糟糕，因为添加新个体会降低使用现有数据的任何流程的性能。

我知道数据库旨在有效地处理大型表，但由于驱动器 IO，我们已经遇到了瓶颈。单个查询仍然无关紧要，但数千个查询加起来很快。我们可以通过将数据库分布在多个驱动器上来稍微缓解这个问题。但是，我想看看是否还有其他替代方案。我一直想知道是否可以通过 individual_id 来分隔连接表条目，这可能会通过向连接表添加额外的个体基因型行来使从个体到基因型的查找不受影响。或者指数已经做到了这一点吗？

原文

The Problem

We have a rapidly growing database with several large join tables (currently in the billions of rows), but as these tables have grown the query time has suffered. The concern is that as more data is added to the tables linked by these join tables, the join tables will continue to grow at a faster pace and adversely impact query speed.

The Background

I am dealing with a database that is storing genomic information. A number of markers (~3 million) corresponding to loci where there are DNA variations are linked to individuals that have had their genotype determined at these loci. Every marker has a number of possible genotypes of which every individual must have one.

The Current Implementation

When the database (postgresql) was still small, there were no problems in linking the genotypes to the the markers using foreign keys and then linking the individuals to their genotypes through a join table. That way, it would be easy to look up all of an individual's genotypes or look up all the individuals having a specific genotype.

A slimmed down version of these tables are listed below:

                                        Table "public.genotypes"
      Column      |            Type             |                       Modifiers                        
------------------+-----------------------------+--------------------------------------------------------
 id               | integer                     | not null default nextval('genotypes_id_seq'::regclass)
 ref_variation_id | integer                     | 
 value            | character varying(255)      |  
Indexes:
    "genotypes_pkey" PRIMARY KEY, btree (id)
    "index_genotypes_on_ref_variation_id" btree (ref_variation_id)


Table "public.genotypes_individuals"
    Column     |  Type   | Modifiers 
---------------+---------+-----------
 genotype_id   | integer | 
 individual_id | integer | 
Indexes:
    "index_genotypes_individuals_on_genotype_id_and_individual_id" UNIQUE, btree (genotype_id, individual_id)
    "index_genotypes_individuals_on_genotype_id" btree (genotype_id)

                                       Table "public.individuals"
    Column     |            Type             |                        Modifiers                         
---------------+-----------------------------+----------------------------------------------------------
 id            | integer                     | not null default nextval('individuals_id_seq'::regclass)
 hap_id        | character varying(255)      | 
 population_id | integer                     | 
 sex           | character varying(255)      | 
Indexes:
    "individuals_pkey" PRIMARY KEY, btree (id)
    "index_individuals_on_hap_id" UNIQUE, btree (hap_id)

The bottleneck right now is looking up all of the genotypes for an individual and having them sorted by their positions. This is used frequently and much more important than looking up individuals from a genotype. Examples of some of these queries are:

A simple lookup of all of an individual's genotypes
SELECT * FROM "genotypes" INNER JOIN "genotypes_individuals" ON "genotypes".id = "genotypes_individuals".genotype_id WHERE ("genotypes_individuals".individual_id = 2946 )
Normally, though this gets limited, because there are a lot of genotypes. We're often only interested in those on a specific chromosome.
SELECT * FROM "genotypes" INNER JOIN "genotypes_individuals" ON "genotypes".id = "genotypes_individuals".genotype_id WHERE ("genotypes_individuals".individual_id = 2946 ) AND ("genotypes".ref_variation_id IN (37142, 37143, ...))
We also still need to occasionally go the other way.
SELECT * FROM "individuals" INNER JOIN "genotypes_individuals" ON "individuals".id = "genotypes_individuals".individual_id WHERE ("genotypes_individuals".genotype_id = 53430)

Every time a new individual is added to the db, the join table grows by about 3 million rows. Intuitively from a design perspective, this seems bad because adding new individuals will slow down the performance on any process using the existing data.

I understand that databases are designed to handle large tables efficiently, but we are already hitting bottlenecks due to the drive IO. An individual query is still inconsequential, but 1000s of them add up quickly. We can alleviate this problem somewhat by spreading the db across multiple drives. However, I wanted to see if there are any other alternatives out there. I have been wondering if it is somehow possible to segregate the join table entries by individual_id, which would maybe leave lookups from individuals to genotypes unimpacted by adding additional individual-genotype rows to the join table. Or do indices already do that?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

白首有我共你 2024-12-08 05:30:57

您看过表分区吗？

回复收藏 0 原文

听闻余生 2024-12-08 05:30:57

我会考虑测试使用自然键而不是 ID 号的模式。

有时，您对个人所有基因型的查找

SELECT * 
FROM "genotypes" 
INNER JOIN "genotypes_individuals" 
        ON "genotypes".id = "genotypes_individuals".genotype_id 
WHERE ("genotypes_individuals".individual_id = 2946 )

变得

SELECT * 
FROM genotypes_individuals
WHERE (individual_id = 2946)

更快。有时并非如此。

在我们的生产系统上切换到自然键将性能中值提高了 10 倍。使用自然键的一些查询运行速度提高了 100 倍，因为自然键消除了很多连接。一些查询也运行得更慢。但无论如何，中值加速是令人印象深刻的。

I would consider testing a schema that used natural keys instead of id numbers.

Your lookup of all of an individual's genotypes

SELECT * 
FROM "genotypes" 
INNER JOIN "genotypes_individuals" 
        ON "genotypes".id = "genotypes_individuals".genotype_id 
WHERE ("genotypes_individuals".individual_id = 2946 )

becomes

SELECT * 
FROM genotypes_individuals
WHERE (individual_id = 2946)

Sometimes that's faster. Sometimes it's not.

Switching to natural keys on our production system increased median performance by a factor of 10. Some queries ran 100 times faster with natural keys, because natural keys eliminated a lot of joins. Some queries ran slower, too. But the median speed-up was impressive anyway.

回复收藏 0 原文

~没有更多了~