我有一个表,表示跨项目修订的源文件指标值,如下所示:(
Revision FileA FileB FileC FileD FileE ...
1 45 3 12 123 124
2 45 3 12 123 124
3 45 3 12 123 124
4 48 3 12 123 124
5 48 3 12 123 124
6 48 3 12 123 124
7 48 15 12 123 124
上述数据的关系视图不同。每行包含以下列:修订、FileId、值。计算得出的数据存储在 Subversion 存储库中,因此我们尝试以关系模式表示存储库的结构。)
10000 个修订版中最多可以有 23750 个文件(ImageMagick 绘图程序)。 正如您所看到的,连续修订之间的大多数值都是相同的,因此该表的有用数据非常稀疏。 我正在寻找一种存储数据的方法,
- 避免复制并有效地使用空间(目前,非稀疏表示需要 260 GB(数据+索引)来存储我想要存储的不到 10% 的数据),
- 使我能够高效地检索使用 SQL 查询(无需显式循环修订或文件)特定修订的值
- 使我能够有效地检索特定指标值的修订。
理想情况下,解决方案不应依赖于特定的 RDBMS 并且应与 休眠。 如果这是不可能的,我可以接受使用 Hibernate、MySQL 或 PostgreSQL 特定的功能。
I have a table representing values of source file metrics across project revisions, like the following:
Revision FileA FileB FileC FileD FileE ...
1 45 3 12 123 124
2 45 3 12 123 124
3 45 3 12 123 124
4 48 3 12 123 124
5 48 3 12 123 124
6 48 3 12 123 124
7 48 15 12 123 124
(The relational view of the above data is different. Each row contains the following columns: Revision, FileId, Value. The files and their revisions from which the data is calculated are stored in Subversion repositories, so we're trying to represent the repository's structure in a relational schema.)
There can be up to 23750 files in 10000 revisions (this is the case for the ImageMagick drawing program). As you can see, most values are the same between successive revisions, so the table's useful data is quite sparse. I am looking for a way to store the data that
- avoids replication and uses space efficiently (currently the non-sparse representation requires 260 GB (data+index) for less than 10% of the data I want to store)
- allows me to retrieve efficiently the values for a specific revision using an SQL query (without explicitly looping through revisions or files)
- allows me to retrieve efficiently the revision for a specific metric value.
Ideally, the solution should not depend on a particular RDBMS and should be compatible with Hibernate. If this is not possible, I can live with using Hibernate, MySQL or PostgreSQL-specific features.
发布评论
评论(1)
这就是我建模的方式。 我省略了修订表和文件表,因为它们应该是不言自明的。
要获取特定修订版本的文件的所有值,您可以使用以下查询。 使用外部联接联接到文件表将使您获得那些没有为该修订版本定义值的文件表。
假设我正确理解您在第三点中想要的内容,这将使您获得特定文件具有特定值的所有修订:
This is how I might model it. I've left out the Revisions table and Files table as those should be pretty self-explanatory.
To get all of the values for files of a particular revision you could use the following query. Joining to the files table with an outer join would let you get those that have no defined value for that revision.
Assuming that I understand correctly what you want in your third point, this will let you get all of the revisions for which a particular file has a certain value: