针对海量数据集的 SQL 数据库设计
我有一个客户,具有以下数据结构......对于每个患者,可能有多个样本,每个样本经过处理后可能有400万个数据对象。每个患者的最大样本数为 20。因此,单个患者最终可能会获得 8000 万行数据,当然最终会有数百个患者。
在设置数据库来存储对象(每个对象包含大约 30 个统计和测量字段)时,挑战非常明确 - 如何管理如此大量的数据?
我想我应该有一个数据库,每个样本都有一个表 - 所以每个表最多可能有 400 万条记录。
我的一位同事提出了一个有趣的建议,那就是更进一步——为每个患者创建一个新数据库,然后为每个样本创建一个表。他的想法是,每个患者拥有 1 个日志、能够为每个患者移动数据库等都是很好的。我不能不同意他的观点。
这合理吗?由于某种原因拥有多个数据库是一个坏主意吗?
想法?谢谢你!
I have a customer that has the following data structure... for each patient, there may be multiple samples, and each sample may, after processing, have 4 million data objects. The max number of samples per patient is 20. So a single patient may end up with 80 million rows of data, and of course there will be many many hundreds of patients eventually.
In setting up a database to store the objects (which each contain about 30 fields of statistics and measurements) the challenge is pretty clear- how to manage this vast amount of data?
I was thinking that I would have one database, with a table for each sample- so each table may have at most 4 million records.
A colleague of mine had an interesting suggestion which was to take it one step further- create a new database per patient and then have a table per sample. His thinking was that having 1 log per patient, being able to move databases on a per patient basis, etc was good. I can't disagree with him.
Is this reasonable? Is it a bad idea for some reason to have many databases?
Thoughts? Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
虽然从隐私和迁移的角度来看这个想法很有趣,但每个患者拥有一个数据库并不是一个好主意。考虑管理、备份每个患者数据库的文件。我什至不确定 DBMS 是否可以在一个实例或服务器中同时处理数百万个数据库。
我要做的是,接受体积数据作为实时事实,并以您选择的参数和表格类型对其进行处理。让 DBMS 担心它的规模。确保您拥有一个允许纵向扩展和横向扩展表的部署模型。每个实体一个表至少是明智的,因此对于患者、测量等而言。
只要做你作为开发人员擅长的事情,让 DBMS 做它创建的目的。
While the idea is interesting from privacy and migration standpoint, it is NOT a good idea to have a single database per patient. Think about managing, backing up, having files for each patient database. I'm even not sure if DBMS can handle millions of databases at the same time in an instance or a server.
What I would do is, accept the volumetric data as facts of live and deal with it in the type of parameters and tables you choose. Let the DBMS worry about the schale of it. Make sure you have a deployment model allowing to scale-up and scale-out your tables. A table per entity, at least would be wise, so for patient, measurement, etc.
Just, do what you are good in as a developer and let the DBMS do what it is created for.
当处理这么多数据时,您肯定会想要探索 MySQL 和 RDBMS 替代方案。您研究过任何 noSQL 解决方案吗? (即键值存储)。有几种开源解决方案,其中一些解决方案立即不适合此应用程序,因为任何数据丢失可能都是不可接受的。
也许尝试查看 Apache 的 Cassandra http://cassandra.apache.org/。它是一个分布式数据库系统(键值存储),但也可以在单个节点上运行。它将允许您将每个患者的所有数据存储在单个键值“即 Patient1”下,然后从那里您可以将数据组织成最适合在应用程序中查询的键值结构。
When working with that much data, you will definitely want to explore MySQL and RDBMS alternatives. Have you looked into any noSQL solutions? (i.e. key value stores). There are several open source solutions, some of which would immediately not be right for this application given that any data loss is probably unacceptable.
Perhaps try looking at Apache's Cassandra http://cassandra.apache.org/. Its a distributed database system (key-value store), but can run on a single node as well. It would allow you to store all of your data for each patient under a single key value "i.e. Patient1" and then from there you could organize your data into whatever key-value structure is best for querying in your application.