在 Linux 中从 Excel 数据透视表电子表格中提取数据
我有一个基于数据透视表的 Excel 电子表格,该数据透视表会定期更新(每月)并上传到我的服务器(由一个对更改输出中的任何内容非常犹豫的小组生成)。我希望能够编写一个可以通过 cron 作业运行的脚本来处理数据透视表中的原始数据并将其加载到我的数据库中。
但是,我无法弄清楚如何获取基础数据,而无需手动进入Windows,在Excel中打开文件,双击总计单元格,获取一个包含填充该单元格的所有原始数据的新工作表,将该工作表保存为 csv,然后我可以通过某种语言(在我的例子中是 python)将其加载到我的数据库中。看起来他们应该是一些可编写脚本的方式来提取底层数据。
我只有 Linux 机器(在虚拟机中运行 Windows/Office;但我更喜欢不涉及 Windows 的解决方案)。我熟悉 xls2csv(不访问原始数据)等工具,并熟悉使用 python-unoconv 等工具从 python 编辑 openoffice 文档。然而,即使手动使用 openoffice,我也看不到获取底层数据的方法。
编辑:在花了好几个小时没有取得任何进展(在发布此内容之前)之后,我不会开始通过 unoconv 将其转换为 ODS 来进行一些操作,并且可能能够使用 python-odf 来提取最后一个表(称为“DPCache”)。
所以现在的问题是如何将 ODS 中的一张表转换为 CSV;这对我来说应该不难理解(尽管非常感谢帮助)。
I have an excel spreadsheet based on a pivot table that is periodically updated (monthly) and uploaded to my server (generated by a group that is very hesitant to changing anything in the output). I would like to be able to write a script that I could run via cron job to process and load the raw data from the pivot table into my database.
However, I can't figure out how to get at the underlying data without manually going into windows, opening the file in excel, double-clicking the totals cell, getting a new sheet with all the raw data that went into populating that cell, and saving that sheet as a csv, that I can then load into my database via some language (in my case python). It seems like their should be some scriptable way to extract the underlying data.
I only have linux machines (running windows/office in a VM; but I'd prefer a solution that doesn't involve windows). I am familiar with tools like xls2csv (which doesn't access the raw data) and using tools like python-unoconv to edit openoffice documents from python. However, even manually using openoffice I don't see a way to get at the underlying data.
EDIT: After spending a good few hours not making any progress (prior to posting this), I'm not starting to make some by converting it to ODS via unoconv and likely will be able to use something with python-odf to extract the last sheet (Called 'DPCache').
So now the problem is to get a sheet from an ODS converted into a CSV; which shouldn't be too hard for me to figure out (though help is greatly appreciated).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我曾经有同样的问题。您可以通过解压缩 xlsx 并读取/解释 xml 文件来解决。比较重要的两个文件是这些。
第一个,具有pivotCacheRecords1.xml中原始数据的关系,您需要通过索引号访问它,我的意思是通过每一列在具有标签
的pivotCacheRecords1.xml中,您需要通过标签
的索引号获取pivotCacheDefinition1.xml中的数据,以便更好地了解后,您需要查看 xml 文件。pivotCacheDefinition1.xml
pivotCacheRecords1.xml
看到 CacheRecords1 标签中的
是 CacheDefinition1 中标签的关系,现在如果你理解了这一点,那么制作一个dict 在记录的迭代中使用它。我们最终得到了这个命令。
然后你所要做的就是迭代 CacheRecords1 并在标签为
时将列的索引与 defdict 中的键进行匹配然后你可以将 dfdata 放入数据框中
剩下的就是历史,我希望这对你有帮助。
快乐编码!!!
i used to have the same issue. You can resolved by unzip the xlsx and reading/interpret the xml files. The two files that are more important are these.
The first one, have the relationshit of the raw data in pivotCacheRecords1.xml, that you need to access by index number, what i mean by this, is that by every column in pivotCacheRecords1.xml that have the tag
<x>
you need to obtain the data in pivotCacheDefinition1.xml by the index number of the tag<x>
, for better understanding, you need to see the xml files.pivotCacheDefinition1.xml
pivotCacheRecords1.xml
See that the
<x>
in CacheRecords1 tag is a relation for the<s>
tag in CacheDefinition1, now if you understand this, is not so dificult to make a dict to use it in the iterations of the records.We endup whith this dict.
Then all you have todo is iterate over CacheRecords1 and match the index of the column with the key in defdict when the tag is
<x>
Then you can put dfdata in a dataframe
The rest is history, i wish this would help you.
Happy Coding!!!
您尝试过 xlrd 吗?另请参阅 python-excel 网站 中提供的教程。
就这么简单:
Have you tried xlrd? See also the tutorial available from the python-excel website.
It's this simple: