使用 Django 和 MySQL 来存储和查找大型 DNA 微阵列结果
我正在尝试设置一个 Django 应用程序,它允许我存储和查找 DNA 微阵列的结果,其中包含约 500k 个独特探针,适用于大量受试者。
我一直在玩弄的模型设置如下:
class Subject(models.Model):
name = models.CharField()
class Chip(models.Model):
chip_name = models.Charfield()
class Probe(models.Model):
chips = models.ManyToManyField(Chip, related_name="probes" )
rs_name = models.CharField(unique=True)
chromosome = models.IntegerField()
location = models.IntegerField()
class Genotype(models.Model):
probe = models.ForeignKey(Probe, related_name='genotypes')
subject = models.ForeignKey(Subject, related_name='genotypes')
genotype = models.CharField()
我想知道是否有更好的方法来设置它?我只是想,对于每个主题,我会在 Genotype 表中创建 500k 行。
如果我使用 MySQL 数据库,它是否能够处理大量主题,每个主题向该表添加 500k 行?
I'm trying to setup a django app that allows me to store and lookup the results of a dna microarray with ~500k unique probes for a large number of subjects.
The model set up I've been toying with is as follows:
class Subject(models.Model):
name = models.CharField()
class Chip(models.Model):
chip_name = models.Charfield()
class Probe(models.Model):
chips = models.ManyToManyField(Chip, related_name="probes" )
rs_name = models.CharField(unique=True)
chromosome = models.IntegerField()
location = models.IntegerField()
class Genotype(models.Model):
probe = models.ForeignKey(Probe, related_name='genotypes')
subject = models.ForeignKey(Subject, related_name='genotypes')
genotype = models.CharField()
I was wondering if there's there a better way to set this up? I was just thinking that for each subject I would be creating 500k rows in the Genotype table.
If I'm using a MySQL db, will it be able to handle a large number of subjects each adding 500k rows to that table?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好吧,如果您需要每个主题的每个探针的结果(基因型),那么标准的多对多中间表(基因型)确实会变得巨大。
如果有 1000 个主题,您将拥有 5 亿条记录。
如果您可以将编码/序列化的
genotype
字段值保存在一个或多列中,那么将大大减少记录数量。保存在单列中编码的 500k 结果将是一个问题,但如果您可以将它们分成组,应该是可行的。这会将记录数量减少到 nr。的主题。或者另一种可能性是将 Probe-s 分组到 ProbeGroup-s 中并具有 nr。探测结果 = nr.主题 * 编号探针组。第一个选项是这样的:
这当然会使搜索/过滤结果变得更加困难,但如果保存的格式很简单,那么应该不会太难。
您可以在基因型列中采用以下格式:“probe1_id|genotype1,probe2_id|genotype2,probe3_id|genotype3,...”
检索特定基因型+探针的受试者查询集。
一个。确定探针属于哪个组
即“C组”-> pg_c_基因型
b.查询相应列的probe_id + 基因型组合。
我提到的另一个选项是也有
ProbeGroup
模型,每个Probe
将有一个ProbeGroup
的外键。然后:您可以同样查询基因型字段,只不过现在您可以直接查询组,而不是确定需要搜索的列。
这样如果你有前任。每组 1000 个探针 -> 500组。那么对于 1000 个主题,您将拥有 500K
SubjectProbeResults
,仍然很多,但肯定比 500M 更易于管理。但是你可以有更少的组,你必须测试什么最有效。Well if you need a result (genotype) per Probe for every Subject, then a standard many-to-many intermediary table (Genotype) is going to get huge indeed.
With 1000 Subjects you'd have 500 million records.
If you could save the values for
genotype
field encoded/serialized in one or more columns, that would reduce the amount of records drastically. Saving 500k results encoded in a single column would be a problem, but if you can split them in groups, should be workable. This would reduce amount of records to nr. of Subjects. Or another possibility could be having Probe-s grouped in ProbeGroup-s and having nr. ProbeResults = nr. Subject * nr. ProbeGroup.First option would be something like:
This will of course make it more difficult to search/filter results, but shouldn't be too hard if the saved format is simple.
You can have the following format in genotype columns: "probe1_id|genotype1,probe2_id|genotype2,probe3_id|genotype3,..."
To retrieve a queryset of subjects for a specific genotype + probe.
a. Determine which group the probe belongs to
i.e "Group C" -> pg_c_genotypes
b. Query the respective column for probe_id + genotype combination.
The other option that I've mentioned is to have
ProbeGroup
model too and eachProbe
will have a ForeignKey toProbeGroup
. And then:You can query the genotypes field the same, except now you can query the group directly, instead of determining the column you need to search.
This way if you have for ex. 1000 probes per group -> 500 groups. Then for 1000 Subjects you'll have 500K
SubjectProbeResults
, still a lot, but certainly more manageable than 500M. But you could have less groups, you'd have to test what works best.