当前位置：文江博客话题详情

优化关系数据库中的日常数据存储

发布于 2024-09-11 01:05:53 字数 2094 浏览 3 评论 0原文

更新：有评论说问题不清楚，我在逻辑上做了一个飞跃，声称我将拥有 1180 亿行。我编辑了下面的文字以澄清问题。请参阅下面的斜体文本，

我已经为此苦苦挣扎了一段时间，甚至走了几条路，但现在我向社区寻求想法。首先，问题是为网格中的单元格存储 6 个日常变量大约 25 年。

变量数量 = 6
单元格数量 ~ 1300 万
天数 ~ 9125 (25 * 365)

针对两种不同类型的查询优化存储 —

查询一：检索所有或部分单元格的单个变量值一天。这类似于图像中每个像素都是单个变量的值。

查询二：检索单个单元格的单个变量的所有天数或天数持续时间的值。这就像从表中抓取一列，其中每行保存一天的所有变量。

因此，我开始设计数据库。单个表其中每一行都是一个单元格一天的值会这样

CREATE TABLE d (
    yr      SMALLINT,
    yday    SMALLINT,
    a       SMALLINT,
    b       SMALLINT,
    d       SMALLINT,
    e       SMALLINT,
    f       SMALLINT,
    g       SMALLINT,
    cell_id INTEGER
)
WITH (
    OIDS=FALSE
)

数据看起来像这样

yr      yday    a   b   c   d   e   f   g   cell_id
----------------------------------------------------
1980    1       x   x   x   x   x   x   x   1
1980    2       x   x   x   x   x   x   x   1
1980    3       x   x   x   x   x   x   x   1
..
1980    365     x   x   x   x   x   x   x   1
...
1981    1       x   x   x   x   x   x   x   1
1981    2       x   x   x   x   x   x   x   1
1981    3       x   x   x   x   x   x   x   1
..
1981    365     x   x   x   x   x   x   x   1

问题：上面的表将有 13 m * 9125 行 ~ 1180 亿行。巨大的索引、缓慢的查询、加载数据的主要问题等等。

所以，我将表划分为年份，就像这样

CREATE TABLE d_<yyyy> (
    CHECK ( yr = <yyyy> )
) INHERITS (d)

嗯...仍然不满意。我最终得到了 1 + 25 个表，但每个年份表现在都有大约 47.5 亿行，并且查询仍然非常慢。

因此，我按年和日将其全部分区，如下所示

CREATE TABLE d_<yyyy>_<yday> (
    CHECK ( yr = <yyyy> AND yday = <yday> )
) INHERITS (d)

每个表现在都有 1300 万行，并且速度相当快（尽管仍然不够快），但现在我有 9K 个表。这有其自身的问题。我无法再查询主表，因为 Pg 尝试锁定所有表并耗尽内存。此外，我无法再进行上面的第二个查询。我可以做类似的事情

SELECT a FROM d_1980_1 WHERE cell_id = 1 
UNION 
SELECT a FROM d_1980_2 WHERE cell_id = 1 
UNION 
SELECT a FROM d_1980_3 WHERE cell_id = 1 
UNION 
SELECT a FROM d_1980_4 WHERE cell_id = 1 
UNION 
...

但是上面的方法并不是最佳的。

任何建议、想法、集思广益将不胜感激。也许 Pg，甚至是 RDBMS，都不是解决这个问题的正确工具，在这种情况下，也欢迎提出替代方案的建议。

原文

Update: There was a comment that the question was not clear, that I made a leap of logic claiming that I would have 118 billion rows. I have edited the text below to clarify things. See the italicized text below

I have been struggling with this for a while now, have even gone down a few paths, but I turn now to the community for ideas. First, the problem — Store six daily variables for ~ 25 years for cells in a grid.

Number of vars = 6
Number of cells ~ 13 million
Number of days ~ 9125 (25 * 365)

Optimize the store for two different kinds of queries —

Query one: Retrieve the value of a single var for all or a portion of the cells for a single day. This is analogous to an image where every pixel is the value of a single var.

Query two: Retrieve values for all the days or a duration of days for a single var for a single cell. This is like grabbing a column out of a table in which each row holds all the vars for a single day.

So, I set about designing the db. A single table where every row is one day's values for one cell would like so

CREATE TABLE d (
    yr      SMALLINT,
    yday    SMALLINT,
    a       SMALLINT,
    b       SMALLINT,
    d       SMALLINT,
    e       SMALLINT,
    f       SMALLINT,
    g       SMALLINT,
    cell_id INTEGER
)
WITH (
    OIDS=FALSE
)

The data would look like so

yr      yday    a   b   c   d   e   f   g   cell_id
----------------------------------------------------
1980    1       x   x   x   x   x   x   x   1
1980    2       x   x   x   x   x   x   x   1
1980    3       x   x   x   x   x   x   x   1
..
1980    365     x   x   x   x   x   x   x   1
...
1981    1       x   x   x   x   x   x   x   1
1981    2       x   x   x   x   x   x   x   1
1981    3       x   x   x   x   x   x   x   1
..
1981    365     x   x   x   x   x   x   x   1

The problem: The above table would have 13 m * 9125 rows ~ 118 billion rows. Huge indexes, slow queries, major issues loading the data, etc.

So, I partitioned the table into years like so

CREATE TABLE d_<yyyy> (
    CHECK ( yr = <yyyy> )
) INHERITS (d)

Hmmm... still no satisfaction. I ended up with 1 + 25 tables, but each of the year tables now had ~ 4.75 billion rows, and the queries were still very slow.

So, I partitioned it all by years and days like so

CREATE TABLE d_<yyyy>_<yday> (
    CHECK ( yr = <yyyy> AND yday = <yday> )
) INHERITS (d)

Each table now has 13 million rows, and is reasonably fast (although still not satisfactorily fast), but now I have 9K tables. That has its own problems. I can't query the master table anymore as Pg tries to lock all the tables and runs out of memory. Additionally, I can't anymore conduct query two above. I could do something like

SELECT a FROM d_1980_1 WHERE cell_id = 1 
UNION 
SELECT a FROM d_1980_2 WHERE cell_id = 1 
UNION 
SELECT a FROM d_1980_3 WHERE cell_id = 1 
UNION 
SELECT a FROM d_1980_4 WHERE cell_id = 1 
UNION 
...

But the above is hardly optimal.

Any suggestions, ideas, brainstorms would be appreciated. Perhaps Pg, or even a RDBMS, is not the right tool for this problem, in which case, suggestion for alternatives would be welcome as well.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

温柔戏命师 2024-09-18 01:05:54

如果不诉诸大量索引或数据重复，我认为很难找到对两个查询都最佳的单一模式设计。

通过按日期或传感器对数据进行聚类，可以使这些条件之一的检索运行得更快，但不能同时运行这两种条件。

假设按日期访问是最重要的，您可以按如下方式布局表格：

CREATE TABLE d (
    day      DATE,
    a       SMALLINT[],
    b       SMALLINT[],
    ...
);

观察现在每天只有一行，并且单元格字段已变成数组，其中每个单元格都有自己的索引。如果单元编号不是从零开始，则可以使用从单元 id 到数组索引的映射来拟合表格。

查询1，

检索单个 var 的值
全部或部分单元格
一天。

完成

SELECT a FROM d WHERE day = '1981-01-01'

例如，通过查询 2

，
检索单个单元格的单个变量的所有天数或天数持续时间的值。

情况

SELECT a[1000] FROM d WHERE day BETWEEN '1981' AND '1982'

我相信 PostgreSQL 中的大型数组可以在不实际加载整个数据结构的下进行访问。我知道 BLOB 是。如果是这种情况，这个解决方案对您来说可能足够快。否则，我建议对数据进行另一个视图，以优化单元格的访问。

Without resorting to massive indexing, or duplication of data, I think it will be difficult to find a single schema design, which is optimal for both of your queries.

By clustering your data by either date or sensor, retrieval by one of these conditions can be made to run fast, but not both at the same time.

Assuming that access by date is the most important, you could layout your table like below:

CREATE TABLE d (
    day      DATE,
    a       SMALLINT[],
    b       SMALLINT[],
    ...
);

Observe that there is now only one row per day, and that cell fields have become arrays, where each cell will have its own index. In case the cell numbering is not zero-based, a table could be fitted with the mapping from cell ids to array indexes.

Query 1,

Retrieve the value of a single var for
all or a portion of the cells for a
single day.

is accomplished by, for example,

SELECT a FROM d WHERE day = '1981-01-01'

Query 2,

Retrieve values for all the days or a duration of days for a single var for a single cell.

will be of the form

SELECT a[1000] FROM d WHERE day BETWEEN '1981' AND '1982'

I believe large arrays in PostgreSQL are accessed without actually loading the whole data structure. I know BLOBs are. If that is the case, this solution may be fast enough for you. Otherwise I would suggest making another view of the data, which optimizes access by cell.

回复收藏 0 原文

~没有更多了~