使用 Django 和 MySQL 来存储和查找大型 DNA 微阵列结果

发布于 2024-10-20 01:38:44 字数 771 浏览 2 评论 0原文

我正在尝试设置一个 Django 应用程序，它允许我存储和查找 DNA 微阵列的结果，其中包含约 500k 个独特探针，适用于大量受试者。

我一直在玩弄的模型设置如下：

class Subject(models.Model):
    name = models.CharField()

class Chip(models.Model):
    chip_name = models.Charfield()

class Probe(models.Model):
    chips = models.ManyToManyField(Chip, related_name="probes" )
    rs_name = models.CharField(unique=True)
    chromosome = models.IntegerField()
    location = models.IntegerField()

class Genotype(models.Model):
    probe = models.ForeignKey(Probe, related_name='genotypes')
    subject = models.ForeignKey(Subject, related_name='genotypes')
    genotype = models.CharField()

我想知道是否有更好的方法来设置它？我只是想，对于每个主题，我会在 Genotype 表中创建 500k 行。

如果我使用 MySQL 数据库，它是否能够处理大量主题，每个主题向该表添加 500k 行？

原文

I'm trying to setup a django app that allows me to store and lookup the results of a dna microarray with ~500k unique probes for a large number of subjects.

The model set up I've been toying with is as follows:

class Subject(models.Model):
    name = models.CharField()

class Chip(models.Model):
    chip_name = models.Charfield()

class Probe(models.Model):
    chips = models.ManyToManyField(Chip, related_name="probes" )
    rs_name = models.CharField(unique=True)
    chromosome = models.IntegerField()
    location = models.IntegerField()

class Genotype(models.Model):
    probe = models.ForeignKey(Probe, related_name='genotypes')
    subject = models.ForeignKey(Subject, related_name='genotypes')
    genotype = models.CharField()

I was wondering if there's there a better way to set this up? I was just thinking that for each subject I would be creating 500k rows in the Genotype table.

If I'm using a MySQL db, will it be able to handle a large number of subjects each adding 500k rows to that table?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

记忆里有你的影子 2024-10-27 01:38:44

好吧，如果您需要每个主题的每个探针的结果（基因型），那么标准的多对多中间表（基因型）确实会变得巨大。
如果有 1000 个主题，您将拥有 5 亿条记录。

如果您可以将编码/序列化的genotype字段值保存在一个或多列中，那么将大大减少记录数量。保存在单列中编码的 500k 结果将是一个问题，但如果您可以将它们分成组，应该是可行的。这会将记录数量减少到 nr。的主题。或者另一种可能性是将 Probe-s 分组到 ProbeGroup-s 中并具有 nr。探测结果 = nr.主题 * 编号探针组。
第一个选项是这样的：

class SubjectProbeResults(models.Model):
    subject = models.ForeignKey(Subject, related_name='probe_results')
    pg_a_genotypes = models.TextField()
    ..
    pg_n_genotypes = models.TextField()

这当然会使搜索/过滤结果变得更加困难，但如果保存的格式很简单，那么应该不会太难。
您可以在基因型列中采用以下格式：“probe1_id|genotype1,probe2_id|genotype2,probe3_id|genotype3,...”

检索特定基因型+探针的受试者查询集。

一个。确定探针属于哪个组
即“C组”-> pg_c_基因型

b．查询相应列的probe_id + 基因型组合。

from django.db.models import Q

qstring = "%s|%s" % (probe_id, genotype)

subjects = Subject.objects.filter(Q(probe_results__pg_c_genotypes__contains=',%s,' % qstring) | \
                               Q(probe_results__pg_c_genotypes__startswith='%s,' % qstring) | \
                               Q(probe_results__pg_c_genotypes__endswith=',%s' % qstring))

我提到的另一个选项是也有 ProbeGroup 模型，每个 Probe 将有一个 ProbeGroup 的外键。然后：

class SubjectProbeResults(models.Model):
    subject = models.ForeignKey(Subject, related_name='probe_results')
    probe_group = models.ForeignKey(ProbeGroup, related_name='probe_results')
    genotypes = models.TextField()

您可以同样查询基因型字段，只不过现在您可以直接查询组，而不是确定需要搜索的列。
这样如果你有前任。每组 1000 个探针 -> 500组。那么对于 1000 个主题，您将拥有 500K SubjectProbeResults，仍然很多，但肯定比 500M 更易于管理。但是你可以有更少的组，你必须测试什么最有效。

Well if you need a result (genotype) per Probe for every Subject, then a standard many-to-many intermediary table (Genotype) is going to get huge indeed.
With 1000 Subjects you'd have 500 million records.

If you could save the values for genotype field encoded/serialized in one or more columns, that would reduce the amount of records drastically. Saving 500k results encoded in a single column would be a problem, but if you can split them in groups, should be workable. This would reduce amount of records to nr. of Subjects. Or another possibility could be having Probe-s grouped in ProbeGroup-s and having nr. ProbeResults = nr. Subject * nr. ProbeGroup.
First option would be something like:

class SubjectProbeResults(models.Model):
    subject = models.ForeignKey(Subject, related_name='probe_results')
    pg_a_genotypes = models.TextField()
    ..
    pg_n_genotypes = models.TextField()

This will of course make it more difficult to search/filter results, but shouldn't be too hard if the saved format is simple.
You can have the following format in genotype columns: "probe1_id|genotype1,probe2_id|genotype2,probe3_id|genotype3,..."

To retrieve a queryset of subjects for a specific genotype + probe.

a. Determine which group the probe belongs to
i.e "Group C" -> pg_c_genotypes

b. Query the respective column for probe_id + genotype combination.

from django.db.models import Q

qstring = "%s|%s" % (probe_id, genotype)

subjects = Subject.objects.filter(Q(probe_results__pg_c_genotypes__contains=',%s,' % qstring) | \
                               Q(probe_results__pg_c_genotypes__startswith='%s,' % qstring) | \
                               Q(probe_results__pg_c_genotypes__endswith=',%s' % qstring))

The other option that I've mentioned is to have ProbeGroup model too and each Probe will have a ForeignKey to ProbeGroup. And then:

class SubjectProbeResults(models.Model):
    subject = models.ForeignKey(Subject, related_name='probe_results')
    probe_group = models.ForeignKey(ProbeGroup, related_name='probe_results')
    genotypes = models.TextField()

You can query the genotypes field the same, except now you can query the group directly, instead of determining the column you need to search.
This way if you have for ex. 1000 probes per group -> 500 groups. Then for 1000 Subjects you'll have 500K SubjectProbeResults, still a lot, but certainly more manageable than 500M. But you could have less groups, you'd have to test what works best.

回复收藏 0 原文

~没有更多了~