是否有可能拥有超过一百万列的 SQL 表?
我正在建立一个微阵列数据数据库。每个患者样本都有超过 1,000,000 个特征,我希望将患者样本作为行存储在 SQL 表中,每个特征作为列。
HuEX Microarray Data
+----+----------+----------+-----+------------------+
| ID | Feature1 | Feature2 | ... | Feature1,000,000 |
+----+----------+----------+-----+------------------+
| 1 | 2.3543 | 10.5454 | ... | 5.34333 |
| 2 | 13.4312 | 1.3432 | ... | 40.23422 |
+----+----------+----------+-----+------------------+
我知道大多数关系数据库系统对表中的列数都有限制。
+------------+-----------------+
| DBMS | Max Table Col # |
+------------+-----------------+
| SQL Server | 1,024 - 30,000 |
| MySQL | 65,535 bytes |
| PostgreSQL | 250 - 1,600 |
| Oracle | 1,000 |
+------------+-----------------+
显然这些限制对于我的任务来说太低了。是否有办法增加 SQL 数据库表可以拥有的列数,或者是否有另一个 DBMS 可以处理如此多的表列?
更新
请注意,所有列都将具有所有行的值。
I am building a database for microarray data. Each patient sample has over 1,000,000 features and I would like to store the patient samples as rows in an SQL table with each feature as a column.
HuEX Microarray Data
+----+----------+----------+-----+------------------+
| ID | Feature1 | Feature2 | ... | Feature1,000,000 |
+----+----------+----------+-----+------------------+
| 1 | 2.3543 | 10.5454 | ... | 5.34333 |
| 2 | 13.4312 | 1.3432 | ... | 40.23422 |
+----+----------+----------+-----+------------------+
I know most relational database systems have limits on the number of columns in a table.
+------------+-----------------+
| DBMS | Max Table Col # |
+------------+-----------------+
| SQL Server | 1,024 - 30,000 |
| MySQL | 65,535 bytes |
| PostgreSQL | 250 - 1,600 |
| Oracle | 1,000 |
+------------+-----------------+
Obviously these limitations are too low for my task. Is there anyway to increase the number of columns an SQL database table can have or is there another DBMS that can handle such high number of table columns?
Update
Note all the columns will have values for all the rows.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
不。
即使你能让它发挥作用,它也会非常缓慢且笨拙。
相反,您应该创建一个单独的表,其中包含
PatientID
、Feature
和Value
列。该表将为您建议的表中的每个单元格分配一行。
它还可以添加有关每个患者特征对的附加信息。
Don't.
Event if you can make it work, it will be very slow and unwieldly.
Instead, you should make a separate table with columns for
PatientID
,Feature
, andValue
.This table would have one row for each cell in your proposed table.
It also makes it possible to add additional information about each patient-feature pair.
您通常会拆分(规范化)表:
SQL 数据库无法处理大量列,但它们可以处理大量行。
You'd normally split (normalize) the tables:
SQL databases can't handle a lot of columns, but they can handle a lot of rows.
尝试重新排列您的表格:
Try rearranging your table to:
这实际上是实体属性值模型 (EAV) 的用例,并且实际上可能更适合某些激烈环境中的非 RDBMS/SQL 解决方案。 (尽管关系数据库是主力……不妨使用它,直到它被证明是不够的;-)
来自维基百科文章:
快乐编码。
This is actually a use case for an Entity-Attribute-Value Model (EAV), and may actually be better suited to non RDBMS/SQL solutions in some intense environments. (Relational database are workhorses though ... might as well use one until it's provably not sufficient ;-)
From the Wikipedia article:
Happy coding.
好吧,根据新信息,这是同质数字(双精度)值的密集数组,查询很重要(也就是说,我将忽略 blob/XML 的反规范化以及特殊 UDF 的使用),我建议如下:
将每个结果拆分为多个记录,其中每个记录的形式为:
q
的值是任意的,但应根据特定的数据库实现来选择(例如尝试适应SQL 中的 8k 记录大小服务器)出于性能/效率原因。然后,每个结果将被拆分为记录,以便
SEGMENT
引用该段。也就是说,给定特征的“绝对索引”为n = SEGMENT * q + x
,并且特征n
将在SEGMENT = n /的记录中找到q 。那么主键就是
(ID, SEGMENT)
。因此,查询仍然很容易——唯一的变化是与段之间的转换——唯一的附加要求是
SEGMENT
(该列也可能参与索引)。(可以使用单独的表将特征映射到
SEGMENT/x
或其他。通过这种方式,它类似于 EAV 模型。)因此,虽然在某些方面与完全规范化的形式相似,但它采用利用初始矩阵的打包/同质/静态特征性质来显着减少记录数量——而 200 万条记录可以说是一个小表,2000 万条记录只是一个“中型”表,2 亿条记录(结果从 200 个芯片 x 每个芯片 100 万个功能(如果每个功能都创造一个记录)开始变得令人畏惧。尽管复杂性相同,
q
为 200 会将记录数量减少到仅仅 1000 万条。 (每个压缩记录在数据/结构比率方面也更加更加高效。)快乐的编码。
虽然以上是我的一个尝试性的“假设”建议,但我鼓励更多地探索这个问题——特别是所需的确切数据访问模式。我不确定这是标准 RDBMS 的“典型”用法,而且 RDBMS 甚至可能不是解决此问题的好方法。
Well, going with the new information that this is dense array of homogenous numeric (double) values and querying is important (that is, I will disregard de-normalization into blobs/XML and the use of special UDFs), I propose the following:
Split each result into multiple records, where each record is of the form:
The value of
q
is arbitrary but should be chosen based upon particular database implementation (e.g. try to fit into the 8k record size in SQL Server) for performance/efficiency reasons.Each result will then be split into the records such that
SEGMENT
refers to the segment. That is the "absolute index" of a given feature isn = SEGMENT * q + x
and featuren
will be found in the record whereSEGMENT = n / q
. It then follows that the Primary Key is(ID, SEGMENT)
.Thus querying is still easy -- the only change is the conversion to/from the segment -- with the only additional requirement being the
SEGMENT
(this column may also participate in an index).(A separate table can be used to map features to
SEGMENT/x
or otherwise. In this way it is similar to an EAV Model.)Thus, while similar in some ways to a fully normlized form it takes advantage of the packed/homogenous/static-feature nature of the initial matrix to significantly reduce the number of records -- while 2 million records is an arguably small table and 20 million records is only a "mid-sized" table, 200 million records (results from 200 chips x 1 million features per chip, if each feature results in a record) starts to become daunting. While the same complexity, a
q
of 200 would reduce the number of records to a mere 10 million. (Each compacted record is also much more efficient in terms of data/structure ratio.)Happy coding.
While the above is one tentative "what if" suggestion on my part, I would encourage exploring the problem more -- in particular, the exact data-access patterns required. I'm not sure this is a "typical" usage of a standard RDBMS and an RDBMS may not even be a good way to approach this problem.