将平面文件加载到规范化的 MySQL 数据库中
将平面文件中的数据加载到 MySQL 数据库中,然后通过外键创建表之间的关系的最快方法是什么?
例如...我有一个格式为以下的平面文件:
[INDIVIDUAL] [POP] [MARKER] [GENOTYPE]
"INDIVIDUAL1", "CEU", "rs55555","AA"
"INDIVIDUAL1", "CEU", "rs535454","GA"
"INDIVIDUAL1", "CEU", "rs555566","AT"
"INDIVIDUAL1", "CEU", "rs12345","TT"
...
"INDIVIDUAL2", "JPT", "rs55555","AT"
我需要将其加载到四个表中:
IND (id,fk_pop,name)
POP (id,population)
MARKER (id,rsid)
GENOTYPE (id,fk_ind,fk_rsid,call)
具体来说,如何以可扩展的方式填充外键? 这些数字大约有 1000 多个个体,每个个体都有超过 100 万个基因型。
What is the fastest way to load data from flatfiles into a MySQL database, and then create the relations between the tables via foreign keys?
For example... I have a flat file in the format:
[INDIVIDUAL] [POP] [MARKER] [GENOTYPE]
"INDIVIDUAL1", "CEU", "rs55555","AA"
"INDIVIDUAL1", "CEU", "rs535454","GA"
"INDIVIDUAL1", "CEU", "rs555566","AT"
"INDIVIDUAL1", "CEU", "rs12345","TT"
...
"INDIVIDUAL2", "JPT", "rs55555","AT"
Which I need to load into four tables:
IND (id,fk_pop,name)
POP (id,population)
MARKER (id,rsid)
GENOTYPE (id,fk_ind,fk_rsid,call)
Specifically, how does one populate the foreign keys in a way that scales? The figures are in the range of 1000+ individuals, each with 1 million+ genotypes.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以从没有外键的基表开始。 然后,您可以在向其他表中插入数据时查找 ID。
另一个想法是您可以用 GUID 替换平面文件中的 ID(INDIVIDUAL1、CEU,...等)。 然后直接使用它们作为 ID 和外键(我注意到这是标记的性能,这可能不会提供最佳的“性能”)。
You could to start with the base tables without foreign keys. You would then lookup the IDs as you insert data in the other tables.
Another idea is that you could replace the IDs in the flat file(INDIVIDUAL1,CEU, ...etc.) by GUIDs . and then use them directly as IDs and foreign keys (i noticed this is tagged performance, this may not give the best "performance").
我会采取多步骤方法来做到这一点。
I would take a multiple step approach to do this.
有一个更简单的方法。
首先,确保对那些应该有一个(名称、人口、rsid)的列有唯一约束。
然后使用如下所示的内容:
请注意,@ 用于指示变量,而不是列名。 在前 2 个 LOAD DATA 中,这些仅用于忽略数据。 在第二个2中,它们用于查找外键。
请注意,可能不会很快:)。
There is a simpler way.
First, make sure you have a UNIQUE constraint on those columns that should have one (name, population, rsid).
Then use something like the following:
Note where the @ is used to indicate variables, rather than column names. In the first 2 LOAD DATAs, these are just used to ignore data. In the second 2, they are used to look up the foreign keys.
Might not be very fast, mind :).