用大量数据填充 PostgreSQL 数据库
我有一个具有一定结构的 PostgreSQL 数据库,并且有几百万个 xml 文件。我必须解析每个文件,获取某些数据并填充数据库中的表。我想知道的是执行此例程的最佳语言/框架/算法。
我使用 DbLinq ORM 用 C# (Mono) 编写了一个程序。它不使用线程,只是逐个解析文件,填充表对象并将特定组的对象(例如 200)提交到数据库。它看起来相当慢:每分钟处理大约 400 个文件,大约需要一个月才能完成这项工作。
我询问你的想法和建议。
I have a PostgreSQL database with a certain structure and I have several million of xml files. I have to parse each file and, get certain data and fill the tables in the database. What I want to know is the most optimal language/framework/algorithm to perform this routine.
I wrote a program in C# (Mono) using DbLinq ORM. It does not use threading, it just parses file by file, filles table objects and submits certain group of objects (for example 200) to the database. It appears to be rather slow: it processes about 400 files per minute and it will take about a month to finish the job.
I ask for your thoughts and tips.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为当您在管道中使用小程序时,速度会更快:
将您的文件加入到一个大流中;
解析输入流并生成 PostgreSQL COPY 格式的输出流 - pg_dump 在创建备份时使用相同的格式,类似于制表符分隔值,如下所示:
例如在 Linux 上:
使用 COPY 比使用 ORM 插入快得多。连接文件将并行读取和写入数据库。禁用“fsync”将允许大幅加速,但如果服务器在加载过程中崩溃,则需要从备份恢复数据库。
I think it would be faster when you'll use small programs in a pipe that will:
join your files into one big stream;
parse input stream and generate an output stream in PostgreSQL COPY format - the same format pg_dump uses when creating backups, similar to tab-separated-values, looks like this:
For example on Linux:
Using COPY is much faster than inserting with ORM. Joining files will parallelise reading and writing to database. Disabling "fsync" will allow for big speedup, but will require restoring a database from backup if a server crashes during loading.
一般来说,我认为 Perl 是解析任务的一个不错的选择。我自己也不了解 Perl。在我看来,您的性能要求如此之高,以至于您可能需要创建一个 XML 解析器,因为标准解析器的性能可能会成为瓶颈(您应该在开始实施之前对其进行测试)。我自己使用 Python 和 psycopg2 与 Postgres 进行通信。
无论您选择哪种语言,您肯定希望使用 COPY FROM 以及可能使用 Perl/Python/其他语言的 stdin 将数据输入 Postgres。
您还可以使用次优解决方案并在 100 个 EC2 实例上极端并行地运行它,而不是花费大量时间来优化所有内容。这比花费大量时间寻找最佳解决方案要便宜得多。
在不知道文件大小的情况下,每分钟 400 个文件听起来还不错。问问自己是否值得花一周的开发时间来将时间减少到三分之一,还是现在就运行并等待一个月。
Generally I believe Perl is a good option for parsing tasks. I do not know Perl myself. It sounds to me is that you have so extreme performance demands that you might need to create an XML parser as the performance of a standard one might become bottleneck (you should test this before you start implementing). I myself use Python and psycopg2 to communicate with Postgres.
Whichever language you choose, you certainly want to use COPY FROM and probably stdin using Perl/Python/other language to feed data into Postgres.
Instead of spending a lot of time optimizing everything, you could also use a suboptimal solution and run it in extreme parallel on say 100 EC2 instances. This would be a lot cheaper than spending hours and hours on finding the optimal solution.
Without knowing anything about the size of the files 400 files per minute does not sound TOO bad. Ask yourself whether it is worth spending a week of development to reduce the time to a third or just running it now and wait for a month.