将非标准化数据集批量加载到 PostgreSQL 的最有效方法?
我已经加载了一个巨大的 CSV 数据集——Eclipse 的 使用 PostgreSQL 的 COPY 过滤的使用数据,并且它占用了大量空间,因为它没有标准化:其中三个 TEXT 列可以更有效地重构为单独的表,以便从具有外键列的主表中引用。
我的问题是:加载所有数据后重构数据库更快,还是创建具有所有约束的预期表,然后加载数据更快?前者涉及重复扫描一个巨大的表(接近 10^9 行),而后者涉及对每个 CSV 行进行多次查询(例如,以前见过此操作类型吗?如果没有,请将其添加到操作表中,获取其ID,在主表中使用正确的操作 ID 创建一行,等等)。
目前,每个重构步骤大约需要一天左右的时间,初始加载也大约需要相同的时间。
I have loaded a huge CSV dataset -- Eclipse's Filtered Usage Data using PostgreSQL's COPY, and it's taking a huge amount of space because it's not normalized: three of the TEXT columns is much more efficiently refactored into separate tables, to be referenced from the main table with foreign key columns.
My question is: is it faster to refactor the database after loading all the data, or to create the intended tables with all the constraints, and then load the data? The former involves repeatedly scanning a huge table (close to 10^9 rows), while the latter would involve doing multiple queries per CSV row (e.g. has this action type been seen before? If not, add it to the actions table, get its ID, create a row in the main table with the correct action ID, etc.).
Right now each refactoring step is taking roughly a day or so, and the initial loading also takes about the same time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据我的经验,您希望将您关心的所有数据放入数据库中的临时表中,然后从那里开始,之后尽可能多地通过存储过程执行基于集合的逻辑。当您加载到临时表时,表上没有任何索引。数据加载到表后创建索引。
查看此链接以获取一些提示 http://www.postgresql.org/docs/ 9.0/interactive/populate.html
From my experience you want to get all the data you care about into a staging table in the database and go from there, after that do as much set based logic as you can most likely via stored procedures. When you load into the staging table don't have any indexes on the table. Create the indexes after the data is loaded into the table.
Check this link out for some tips http://www.postgresql.org/docs/9.0/interactive/populate.html