Django 拥有庞大的 mysql 数据库
将数百万记录 csv 文件导入 django 的最佳方法是什么?
目前使用python csv模块,处理100万条记录文件需要2-4天。它会检查记录是否已经存在,以及其他一些检查。
这个过程能否实现在几个小时内执行。
可以以某种方式使用memcache吗?
更新:还有 django ManyToManyField 字段也得到处理。这些将如何与直接负载一起使用。
What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我不确定你的情况,但我们在 Django 中遇到了类似的情况,大约 3000 万条记录需要一天以上的时间才能导入。
由于我们的客户完全不满意(有失去项目的危险),在使用 Python 进行了几次失败的优化尝试之后,我们采取了根本性的策略改变,并使用 Java 和 JDBC(仅)进行了导入(+一些 mysql 调整),并得到了导入时间缩短至约 45 分钟(对于 Java,由于有非常好的 IDE 和分析器支持,因此很容易优化)。
I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).
我建议直接使用 MySQL Python 驱动程序。此外,您可能需要考虑一些多线程选项。
I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.
根据数据格式(您说的 CSV)和数据库,您可能最好将数据直接加载到数据库中(直接加载到 Django 管理的表中,或加载到临时表中)。例如,Oracle 和 SQL Server 提供了用于加载大量数据的自定义工具。对于 MySQL,您可以使用很多技巧。例如,您可以编写一个 perl/python 脚本来读取 CSV 文件并创建一个包含插入语句的 SQL 脚本,然后将 SQL 脚本直接提供给 MySQL。
正如其他人所说,在加载大量数据之前始终删除索引和触发器,然后再将它们添加回来 - 每次插入后重建索引是主要的处理命中。
如果您正在使用事务,请关闭它们或批量插入以防止事务太大(太大的定义各不相同,但如果您正在处理 100 万行数据,请将其分解为 1000 个事务大概是对的)。
最重要的是,首先备份您的数据库!比由于导入失败而必须从备份恢复数据库更糟糕的事情是没有当前备份可以恢复。
Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.
如前所述,您希望绕过 ORM 并直接访问数据库。根据您使用的数据库类型,您可能会找到直接加载 CSV 数据的好选项。对于 Oracle,您可以使用 外部表 进行非常高速的数据加载,对于 mysql,您可以使用LOAD 命令。我确信 Postgres 也有类似的东西。
加载数百万条记录不会花费近 2-4 天的时间;我经常使用 mysqldump 在几分钟内将包含数百万行的数据库加载到在负载终端机器上运行的 mysql 中。
As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.
就像克雷格说的,你最好先直接填充数据库。
这意味着创建适合 CSV 单元格的 django 模型(然后您可以创建更好的模型和脚本来移动数据)
然后,数据库馈送:执行此操作的首选工具是 Navicat,您可以在他们的网站上获取 30 天的功能演示地点。它允许您在 MySQL 中导入 CSV,将导入配置文件保存在 XML 中...
然后我会从 Django 中启动数据控制脚本,完成后,使用 South 迁移模型以获得所需的内容,或者像我之前所说的那样,在项目中创建另一组模型并使用脚本来转换/复制数据。
Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.