稀疏数据:RDBMS 中的高效存储和检索

发布于 2024-07-11 09:52:24 字数 1124 浏览 6 评论 0 原文

我有一个表,表示跨项目修订的源文件指标值,如下所示:(

Revision FileA FileB FileC FileD FileE ...
1           45     3    12   123   124
2           45     3    12   123   124
3           45     3    12   123   124
4           48     3    12   123   124
5           48     3    12   123   124
6           48     3    12   123   124
7           48    15    12   123   124

上述数据的关系视图不同。每行包含以下列:修订、FileId、值。计算得出的数据存储在 Subversion 存储库中,因此我们尝试以关系模式表示存储库的结构。)

10000 个修订版中最多可以有 23750 个文件(ImageMagick 绘图程序)。 正如您所看到的,连续修订之间的大多数值都是相同的,因此该表的有用数据非常稀疏。 我正在寻找一种存储数据的方法,

  • 避免复制并有效地使用空间(目前,非稀疏表示需要 260 GB(数据+索引)来存储我想要存储的不到 10% 的数据),
  • 使我能够高效地检索使用 SQL 查询(无需显式循环修订或文件)特定修订的值
  • 使我能够有效地检索特定指标值的修订。

理想情况下,解决方案不应依赖于特定的 RDBMS 并且应与 休眠。 如果这是不可能的,我可以接受使用 Hibernate、MySQL 或 PostgreSQL 特定的功能。

I have a table representing values of source file metrics across project revisions, like the following:

Revision FileA FileB FileC FileD FileE ...
1           45     3    12   123   124
2           45     3    12   123   124
3           45     3    12   123   124
4           48     3    12   123   124
5           48     3    12   123   124
6           48     3    12   123   124
7           48    15    12   123   124

(The relational view of the above data is different. Each row contains the following columns: Revision, FileId, Value. The files and their revisions from which the data is calculated are stored in Subversion repositories, so we're trying to represent the repository's structure in a relational schema.)

There can be up to 23750 files in 10000 revisions (this is the case for the ImageMagick drawing program). As you can see, most values are the same between successive revisions, so the table's useful data is quite sparse. I am looking for a way to store the data that

  • avoids replication and uses space efficiently (currently the non-sparse representation requires 260 GB (data+index) for less than 10% of the data I want to store)
  • allows me to retrieve efficiently the values for a specific revision using an SQL query (without explicitly looping through revisions or files)
  • allows me to retrieve efficiently the revision for a specific metric value.

Ideally, the solution should not depend on a particular RDBMS and should be compatible with Hibernate. If this is not possible, I can live with using Hibernate, MySQL or PostgreSQL-specific features.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

各自安好 2024-07-18 09:52:24

这就是我建模的方式。 我省略了修订表和文件表,因为它们应该是不言自明的。

CREATE TABLE Revision_Files
(
    start_revision_number   INT NOT NULL,
    end_revision_number     INT NOT NULL,
    file_number             INT NOT NULL,
    value                   INT NOT NULL,
    CONSTRAINT PK_Revision_Files PRIMARY KEY CLUSTERED (start_revision_number, file_number),
    CONSTRAINT CHK_Revision_Files_start_before_end CHECK (start_revision_number <= end_revision_number)
)
GO

要获取特定修订版本的文件的所有值,您可以使用以下查询。 使用外部联接联接到文件表将使您获得那些没有为该修订版本定义值的文件表。

SELECT
    REV.revision_number,
    RF.file_number,
    RF.value
FROM
    Revisions REV
INNER JOIN Revision_Files RF ON
    RF.start_revision_number <= REV.revision_number AND
    RF.end_revision_number >= REV.revision_number
GO

假设我正确理解您在第三点中想要的内容,这将使您获得特定文件具有特定值的所有修订:

SELECT
    REV.revision_number
FROM
    Revision_Files RF
INNER JOIN Revisions REV ON
    REV.revision_number BETWEEN RF.start_revision_number AND RF.end_revision_number
WHERE
    RF.file_number = @file_number AND
    RF.value = @value
GO

This is how I might model it. I've left out the Revisions table and Files table as those should be pretty self-explanatory.

CREATE TABLE Revision_Files
(
    start_revision_number   INT NOT NULL,
    end_revision_number     INT NOT NULL,
    file_number             INT NOT NULL,
    value                   INT NOT NULL,
    CONSTRAINT PK_Revision_Files PRIMARY KEY CLUSTERED (start_revision_number, file_number),
    CONSTRAINT CHK_Revision_Files_start_before_end CHECK (start_revision_number <= end_revision_number)
)
GO

To get all of the values for files of a particular revision you could use the following query. Joining to the files table with an outer join would let you get those that have no defined value for that revision.

SELECT
    REV.revision_number,
    RF.file_number,
    RF.value
FROM
    Revisions REV
INNER JOIN Revision_Files RF ON
    RF.start_revision_number <= REV.revision_number AND
    RF.end_revision_number >= REV.revision_number
GO

Assuming that I understand correctly what you want in your third point, this will let you get all of the revisions for which a particular file has a certain value:

SELECT
    REV.revision_number
FROM
    Revision_Files RF
INNER JOIN Revisions REV ON
    REV.revision_number BETWEEN RF.start_revision_number AND RF.end_revision_number
WHERE
    RF.file_number = @file_number AND
    RF.value = @value
GO
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文