在 Linux 中从 Excel 数据透视表电子表格中提取数据

发布于 2024-10-07 22:23:09 字数 655 浏览 1 评论 0原文

我有一个基于数据透视表的 Excel 电子表格,该数据透视表会定期更新(每月)并上传到我的服务器(由一个对更改输出中的任何内容非常犹豫的小组生成)。我希望能够编写一个可以通过 cron 作业运行的脚本来处理数据透视表中的原始数据并将其加载到我的数据库中。

但是,我无法弄清楚如何获取基础数据,而无需手动进入Windows,在Excel中打开文件,双击总计单元格,获取一个包含填充该单元格的所有原始数据的新工作表,将该工作表保存为 csv,然后我可以通过某种语言(在我的例子中是 python)将其加载到我的数据库中。看起来他们应该是一些可编写脚本的方式来提取底层数据。

我只有 Linux 机器(在虚拟机中运行 Windows/Office;但我更喜欢不涉及 Windows 的解决方案)。我熟悉 xls2csv(不访问原始数据)等工具,并熟悉使用 python-unoconv 等工具从 python 编辑 openoffice 文档。然而,即使手动使用 openoffice,我也看不到获取底层数据的方法。

编辑:在花了好几个小时没有取得任何进展(在发布此内容之前)之后,我不会开始通过 unoconv 将其转换为 ODS 来进行一些操作,并且可能能够使用 python-odf 来提取最后一个表(称为“DPCache”)。

所以现在的问题是如何将 ODS 中的一张表转换为 CSV;这对我来说应该不难理解(尽管非常感谢帮助)。

I have an excel spreadsheet based on a pivot table that is periodically updated (monthly) and uploaded to my server (generated by a group that is very hesitant to changing anything in the output). I would like to be able to write a script that I could run via cron job to process and load the raw data from the pivot table into my database.

However, I can't figure out how to get at the underlying data without manually going into windows, opening the file in excel, double-clicking the totals cell, getting a new sheet with all the raw data that went into populating that cell, and saving that sheet as a csv, that I can then load into my database via some language (in my case python). It seems like their should be some scriptable way to extract the underlying data.

I only have linux machines (running windows/office in a VM; but I'd prefer a solution that doesn't involve windows). I am familiar with tools like xls2csv (which doesn't access the raw data) and using tools like python-unoconv to edit openoffice documents from python. However, even manually using openoffice I don't see a way to get at the underlying data.

EDIT: After spending a good few hours not making any progress (prior to posting this), I'm not starting to make some by converting it to ODS via unoconv and likely will be able to use something with python-odf to extract the last sheet (Called 'DPCache').

So now the problem is to get a sheet from an ODS converted into a CSV; which shouldn't be too hard for me to figure out (though help is greatly appreciated).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

李白 2024-10-14 22:23:09

我曾经有同样的问题。您可以通过解压缩 xlsx 并读取/解释 xml 文件来解决。比较重要的两个文件是这些。

  • xl/pivotCache/pivotCacheDefinition1.xml
  • xl/pivotCache/pivotCacheRecords1.xml

第一个,具有pivotCacheRecords1.xml中原始数据的关系,您需要通过索引号访问它,我的意思是通过每一列在具有标签的pivotCacheRecords1.xml中,您需要通过标签的索引号获取pivotCacheDefinition1.xml中的数据,以便更好地了解后,您需要查看 xml 文件。

pivotCacheDefinition1.xml

    <?xml version="1.0" encoding="UTF-8"?>
<pivotCacheDefinition xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" r:id="rId1" refreshedBy="ADNLatam" refreshedDate="42972.64919178241" createdVersion="5" refreshedVersion="6" recordCount="1923161">
   <cacheSource type="external" connectionId="1" />
   <cacheFields count="26">
      <cacheField name="C - Cadenas" numFmtId="0" sqlType="-9">
         <sharedItems count="3">
            <s v="superA" />
            <s v="superB" />
            <s v="superC" u="1" />
         </sharedItems>
      </cacheField>
      <cacheField name="C - Locales" numFmtId="0" sqlType="-9"><span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>
         <sharedItems count="80">
            <s v="Itaugua" />
            <s v="Denis Roa" />
            <s v="Total" />
            <s v="Los Laureles" />
            <s v="CDE" />
            <s v="S6 Fdo." />
            <s v="Central" u="1" />
            <s v="Unicompra" u="1" />
            <s v="San Lorenzo Centro" u="1" />
         </sharedItems>
      </cacheField>
   </cacheFields>
</pivotCacheDefinition>
</xml>

pivotCacheRecords1.xml

<pivotCacheRecords
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" count="246209">
<r>
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <s v="PAÐAL "PAMPERS" BABYSAN REGULAR GDE 9UN"/> #Z - Sku / Descripcion
    <s v="07501006720341"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <n v="1"/> 
    <n v="11990"/> 
    <n v="2.3199999999999998"/> 
    <n v="10900"/> 
    <n v="11990"/> 
    <n v="1"/> 
    <d v="2012-02-03T00:00:00"/> 
    <x v="0"/> 
    <x v="0"/> 
    <n v="3"/> 
    <n v="6"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
</r>

看到 CacheRecords1 标签中的 是 CacheDefinition1 中 标签的关系,现在如果你理解了这一点,那么制作一个dict 在记录的迭代中使用它。

      definitions = '/tmp/scantrack_tmp/xl/pivotCache/pivotCacheDefinition1.xml'
      defdict = {}
      columnas = []
      e = xml.etree.ElementTree.parse(definitions).getroot()
      for fields in e.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}cacheFields'):
          for cidx, field in enumerate(fields.getchildren()):
              columna = field.attrib.get('name')
              defdict[cidx] = []
              columnas.append(columna)
              for value in field.getchildren()[0].getchildren():
                  tagname = value.tag
                  defdict[cidx].append(value.attrib.get('v', 0))

我们最终得到了这个命令。

{
  0: ['supera', 'superb', u'superc'],
  1: ['Terminal',
     'CDE',
     'Brasilia',
     ]
  3: ['PANTENE', 'DOVE']
  ...
}

然后你所要做的就是迭代 CacheRecords1 并在标签为 时将列的索引与 defdict 中的键进行匹配

  dfdata = []


  bdata = '/tmp/scantrack_tmp/xl/pivotCache/pivotCacheRecords1.xml'

  for event, elem in xml.etree.ElementTree.iterparse(bdata, events=('start', 'end')):
    if elem.tag == '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}r' and event == 'start':
       tmpdata = []
       for cidx, valueobj in enumerate(elem.getchildren()):
           tagname = valueobj.tag
           vattrib = valueobj.attrib.get('v')
           rdata = vattrib
           if tagname == '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}x':
                try:
                  rdata = defdict[cidx][int(vattrib)]
                except:
                  logging.error('this it not should happen index cidx = {} vattrib = {} defaultidcts = {} tmpdata for the time = {} xml raw {}'.format(
                                                                                                                                                cidx, vattrib, defdict, tmpdata,
                                                                                                                                                xml.etree.ElementTree.tostring(elem, encoding='utf8', method='xml')
                                                                                                                                                ))
           tmpdata.append(rdata)
       if tmpdata:
           dfdata.append(tmpdata)
       elem.clear()

然后你可以将 dfdata 放入数据框中

df = pd.DataFrame(dfdata).

剩下的就是历史,我希望这对你有帮助。

快乐编码!!!

i used to have the same issue. You can resolved by unzip the xlsx and reading/interpret the xml files. The two files that are more important are these.

  • xl/pivotCache/pivotCacheDefinition1.xml
  • xl/pivotCache/pivotCacheRecords1.xml

The first one, have the relationshit of the raw data in pivotCacheRecords1.xml, that you need to access by index number, what i mean by this, is that by every column in pivotCacheRecords1.xml that have the tag <x> you need to obtain the data in pivotCacheDefinition1.xml by the index number of the tag <x>, for better understanding, you need to see the xml files.

pivotCacheDefinition1.xml

    <?xml version="1.0" encoding="UTF-8"?>
<pivotCacheDefinition xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" r:id="rId1" refreshedBy="ADNLatam" refreshedDate="42972.64919178241" createdVersion="5" refreshedVersion="6" recordCount="1923161">
   <cacheSource type="external" connectionId="1" />
   <cacheFields count="26">
      <cacheField name="C - Cadenas" numFmtId="0" sqlType="-9">
         <sharedItems count="3">
            <s v="superA" />
            <s v="superB" />
            <s v="superC" u="1" />
         </sharedItems>
      </cacheField>
      <cacheField name="C - Locales" numFmtId="0" sqlType="-9"><span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>
         <sharedItems count="80">
            <s v="Itaugua" />
            <s v="Denis Roa" />
            <s v="Total" />
            <s v="Los Laureles" />
            <s v="CDE" />
            <s v="S6 Fdo." />
            <s v="Central" u="1" />
            <s v="Unicompra" u="1" />
            <s v="San Lorenzo Centro" u="1" />
         </sharedItems>
      </cacheField>
   </cacheFields>
</pivotCacheDefinition>
</xml>

pivotCacheRecords1.xml

<pivotCacheRecords
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" count="246209">
<r>
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <s v="PAÐAL "PAMPERS" BABYSAN REGULAR GDE 9UN"/> #Z - Sku / Descripcion
    <s v="07501006720341"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <n v="1"/> 
    <n v="11990"/> 
    <n v="2.3199999999999998"/> 
    <n v="10900"/> 
    <n v="11990"/> 
    <n v="1"/> 
    <d v="2012-02-03T00:00:00"/> 
    <x v="0"/> 
    <x v="0"/> 
    <n v="3"/> 
    <n v="6"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
</r>

See that the <x> in CacheRecords1 tag is a relation for the <s> tag in CacheDefinition1, now if you understand this, is not so dificult to make a dict to use it in the iterations of the records.

      definitions = '/tmp/scantrack_tmp/xl/pivotCache/pivotCacheDefinition1.xml'
      defdict = {}
      columnas = []
      e = xml.etree.ElementTree.parse(definitions).getroot()
      for fields in e.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}cacheFields'):
          for cidx, field in enumerate(fields.getchildren()):
              columna = field.attrib.get('name')
              defdict[cidx] = []
              columnas.append(columna)
              for value in field.getchildren()[0].getchildren():
                  tagname = value.tag
                  defdict[cidx].append(value.attrib.get('v', 0))

We endup whith this dict.

{
  0: ['supera', 'superb', u'superc'],
  1: ['Terminal',
     'CDE',
     'Brasilia',
     ]
  3: ['PANTENE', 'DOVE']
  ...
}

Then all you have todo is iterate over CacheRecords1 and match the index of the column with the key in defdict when the tag is <x>

  dfdata = []


  bdata = '/tmp/scantrack_tmp/xl/pivotCache/pivotCacheRecords1.xml'

  for event, elem in xml.etree.ElementTree.iterparse(bdata, events=('start', 'end')):
    if elem.tag == '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}r' and event == 'start':
       tmpdata = []
       for cidx, valueobj in enumerate(elem.getchildren()):
           tagname = valueobj.tag
           vattrib = valueobj.attrib.get('v')
           rdata = vattrib
           if tagname == '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}x':
                try:
                  rdata = defdict[cidx][int(vattrib)]
                except:
                  logging.error('this it not should happen index cidx = {} vattrib = {} defaultidcts = {} tmpdata for the time = {} xml raw {}'.format(
                                                                                                                                                cidx, vattrib, defdict, tmpdata,
                                                                                                                                                xml.etree.ElementTree.tostring(elem, encoding='utf8', method='xml')
                                                                                                                                                ))
           tmpdata.append(rdata)
       if tmpdata:
           dfdata.append(tmpdata)
       elem.clear()

Then you can put dfdata in a dataframe

df = pd.DataFrame(dfdata).

The rest is history, i wish this would help you.

Happy Coding!!!

半﹌身腐败 2024-10-14 22:23:09

您尝试过 xlrd 吗?另请参阅 python-excel 网站 中提供的教程。

就这么简单:

>>> import xlrd
>>> book = xlrd.open_workbook('pivot_table_demo.xls')
>>> sheet = book.sheet_by_name('Summary')
>>> for row_index in xrange(sheet.nrows):
...     print sheet.row_values(row_index)
...
[u'Sum of sales', u'qtr', '', '', '', '']
[u'person', 1.0, 2.0, 3.0, 4.0, u'Grand Total']
[u'dick', 100.0, 99.0, 95.0, 90.0, 384.0]
[u'harriet', 100.0, 110.0, 121.0, 133.1, 464.1]
[u'tom', 100.0, 101.0, 102.0, 103.0, 406.0]
[u'Grand Total', 300.0, 310.0, 318.0, 326.1, 1254.1]
>>>

Have you tried xlrd? See also the tutorial available from the python-excel website.

It's this simple:

>>> import xlrd
>>> book = xlrd.open_workbook('pivot_table_demo.xls')
>>> sheet = book.sheet_by_name('Summary')
>>> for row_index in xrange(sheet.nrows):
...     print sheet.row_values(row_index)
...
[u'Sum of sales', u'qtr', '', '', '', '']
[u'person', 1.0, 2.0, 3.0, 4.0, u'Grand Total']
[u'dick', 100.0, 99.0, 95.0, 90.0, 384.0]
[u'harriet', 100.0, 110.0, 121.0, 133.1, 464.1]
[u'tom', 100.0, 101.0, 102.0, 103.0, 406.0]
[u'Grand Total', 300.0, 310.0, 318.0, 326.1, 1254.1]
>>>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文