我们使用 OpenPyxl 将 MySQL 内容以 XSLX 格式导出到 Microsoft Excel
https://bitbucket.org/ericgazoni/openpyxl /overview
然而,我们处理的数据量很大。我们正面临内存不足的情况。表最多可包含 50000 多行中的 400 列。即使文件很大,也没有大到 Microsoft Excel 或 OpenOffice 会出现问题的程度。
我们假设我们的问题主要源于 Python 将 XML DOM 结构保存在内存中的方式不够有效。
编辑:OpenPyxl 的作者 Eric 指出,有一个选项可以使 OpenPyxl 以固定的内存使用量进行写入。然而,这并没有完全解决我们的问题,因为我们仍然存在原始速度问题以及其他占用 Python 过多内存的问题。
现在我们正在寻找更有效的方法来创建 Excel 文件。最好使用Python,但如果我们找不到好的解决方案,我们可能还想看看其他编程语言。
选项(不按任何特定顺序)包括
1) 使用 OpenOffice 和 PyUno,并希望它们的内存结构比 OpenPyxl 更高效,并且 TCP/IP 调用桥足够高效
2) Openpyxl 使用 xml.etree。 Python lxml(libxml2 本机扩展)对于 XML 内存结构会更有效吗?是否可以直接用 lxml 插件(例如猴子修补)替换 xml.etree? (如果有明显的好处,稍后可以将更改贡献回 Openpyxl)
3) 从 MySQL 导出到 CSV,然后使用 Python 和文件迭代将 CSV 文件直接后处理到 XSLX
4) 使用其他编程语言和库 (Java)
指针:
http://dev.lethain.com/handling-very-large-csv-and-xml-files-in-python/
http://enginoz.wordpress.com/2010/03/31/writing-xlsx-with-java/
We are using OpenPyxl to export MySQL content to Microsoft Excel in XSLX format
https://bitbucket.org/ericgazoni/openpyxl/overview
However, the amount of data we are dealing with is big. We are running to out of memory situation. Tables may contain up to 400 columns in 50000+ rows. Even the files are big, they are not that big that Microsoft Excel or OpenOffice should have problems with them.
We are assuming our issues mainly stem from the fact that Python keeps XML DOM structure in memory in not efficient enough manner.
EDIT: Eric, the author of OpenPyxl, pointed out that there is an option to make OpenPyxl write with fixed memory usage. However, this didn't solve our problem completely, as we still have issues with raw speed and something else taking up too much memory in Python.
Now we are looking for more efficient ways to create Excel files. With Python preferably, but if we cannot find a good solution we might want to look other programming languages as well.
Options, not in any specific order, include
1) Using OpenOffice and PyUno and hope their memory structures are more efficient than with OpenPyxl and the TCP/IP call bridge is efficient enough
2) Openpyxl uses xml.etree. Would Python lxml (libxml2 native extension) be more efficient wit XML memory structures and is it possible to replace xml.etree directly with lxml drop-in e.g. with monkey-patching? (later the changes could be contributed back to Openpyxl if there is a clear benefit)
3) Export from MySQL to CSV and then post-process CSV files directly to XSLX using Python and file iteration
4) Use other programming languages and libraries (Java)
Pointers:
http://dev.lethain.com/handling-very-large-csv-and-xml-files-in-python/
http://enginoz.wordpress.com/2010/03/31/writing-xlsx-with-java/
发布评论
评论(2)
如果您要使用 Java,您将需要使用 Apache POI,但可能不会使用常规 UserModel,因为您希望减少内存占用。
相反,请查看 BigGridDemo,它向您展示了如何使用 POI 编写非常大的 xlsx 文件,并且大部分工作不在内存中进行。
您可能还会发现 BigGridDemo 中使用的技术同样可以在 Python 中使用?
If you're going to use Java, you will want to use Apache POI, but likely not the regular UserModel as you're wanting to keep your memory footprint down.
Instead, take a look at BigGridDemo, which shows you how to write a very large xlsx file using POI, with most of the work not happening in memory.
You might also find that the technique used in the BigGridDemo could equally be used in Python?
您是否尝试过查看 openpyxl 的优化编写器?这是一个最近的功能(2 个月前),但它非常强大(在多个公司项目的生产中使用),并且可以在稳定的内存消耗(大约 7Mb)的情况下处理几乎无限量的数据
http://packages.python.org/openpyxl/optimized.html#optimized-writer
Have you tried to look at the optimized writer for openpyxl ? It's a recent feature (2 months old), but it's quite robust (used in production in several corporate projects) and can handle almost indefinite amount of data with steady memory consumption (around 7Mb)
http://packages.python.org/openpyxl/optimized.html#optimized-writer