在 Linux 中从 Excel 数据透视表电子表格中提取数据

发布于 2024-10-07 22:23:09 字数 655 浏览 1 评论 0原文

我有一个基于数据透视表的 Excel 电子表格，该数据透视表会定期更新（每月）并上传到我的服务器（由一个对更改输出中的任何内容非常犹豫的小组生成）。我希望能够编写一个可以通过 cron 作业运行的脚本来处理数据透视表中的原始数据并将其加载到我的数据库中。

但是，我无法弄清楚如何获取基础数据，而无需手动进入Windows，在Excel中打开文件，双击总计单元格，获取一个包含填充该单元格的所有原始数据的新工作表，将该工作表保存为 csv，然后我可以通过某种语言（在我的例子中是 python）将其加载到我的数据库中。看起来他们应该是一些可编写脚本的方式来提取底层数据。

我只有 Linux 机器（在虚拟机中运行 Windows/Office；但我更喜欢不涉及 Windows 的解决方案）。我熟悉 xls2csv（不访问原始数据）等工具，并熟悉使用 python-unoconv 等工具从 python 编辑 openoffice 文档。然而，即使手动使用 openoffice，我也看不到获取底层数据的方法。

编辑：在花了好几个小时没有取得任何进展（在发布此内容之前）之后，我不会开始通过 unoconv 将其转换为 ODS 来进行一些操作，并且可能能够使用 python-odf 来提取最后一个表（称为“DPCache”）。

所以现在的问题是如何将 ODS 中的一张表转换为 CSV；这对我来说应该不难理解（尽管非常感谢帮助）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

李白 2024-10-14 22:23:09

我曾经有同样的问题。您可以通过解压缩 xlsx 并读取/解释 xml 文件来解决。比较重要的两个文件是这些。

xl/pivotCache/pivotCacheDefinition1.xml
xl/pivotCache/pivotCacheRecords1.xml

第一个，具有pivotCacheRecords1.xml中原始数据的关系，您需要通过索引号访问它，我的意思是通过每一列在具有标签的pivotCacheRecords1.xml中，您需要通过标签的索引号获取pivotCacheDefinition1.xml中的数据，以便更好地了解后，您需要查看 xml 文件。

pivotCacheDefinition1.xml

    <?xml version="1.0" encoding="UTF-8"?>
<pivotCacheDefinition xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" r:id="rId1" refreshedBy="ADNLatam" refreshedDate="42972.64919178241" createdVersion="5" refreshedVersion="6" recordCount="1923161">
   <cacheSource type="external" connectionId="1" />
   <cacheFields count="26">
      <cacheField name="C - Cadenas" numFmtId="0" sqlType="-9">
         <sharedItems count="3">
            <s v="superA" />
            <s v="superB" />
            <s v="superC" u="1" />
         </sharedItems>
      </cacheField>
      <cacheField name="C - Locales" numFmtId="0" sqlType="-9"><span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>
         <sharedItems count="80">
            <s v="Itaugua" />
            <s v="Denis Roa" />
            <s v="Total" />
            <s v="Los Laureles" />
            <s v="CDE" />
            <s v="S6 Fdo." />
            <s v="Central" u="1" />
            <s v="Unicompra" u="1" />
            <s v="San Lorenzo Centro" u="1" />
         </sharedItems>
      </cacheField>
   </cacheFields>
</pivotCacheDefinition>
</xml>

pivotCacheRecords1.xml

<pivotCacheRecords
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" count="246209">
<r>
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <s v="PAÐAL "PAMPERS" BABYSAN REGULAR GDE 9UN"/> #Z - Sku / Descripcion
    <s v="07501006720341"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <n v="1"/> 
    <n v="11990"/> 
    <n v="2.3199999999999998"/> 
    <n v="10900"/> 
    <n v="11990"/> 
    <n v="1"/> 
    <d v="2012-02-03T00:00:00"/> 
    <x v="0"/> 
    <x v="0"/> 
    <n v="3"/> 
    <n v="6"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
</r>

看到 CacheRecords1 标签中的是 CacheDefinition1 中 ~~标签的关系，现在如果你理解了这一点，那么制作一个dict 在记录的迭代中使用它。~~

      definitions = '/tmp/scantrack_tmp/xl/pivotCache/pivotCacheDefinition1.xml'
      defdict = {}
      columnas = []
      e = xml.etree.ElementTree.parse(definitions).getroot()
      for fields in e.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}cacheFields'):
          for cidx, field in enumerate(fields.getchildren()):
              columna = field.attrib.get('name')
              defdict[cidx] = []
              columnas.append(columna)
              for value in field.getchildren()[0].getchildren():
                  tagname = value.tag
                  defdict[cidx].append(value.attrib.get('v', 0))

我们最终得到了这个命令。

{
  0: ['supera', 'superb', u'superc'],
  1: ['Terminal',
     'CDE',
     'Brasilia',
     ]
  3: ['PANTENE', 'DOVE']
  ...
}

然后你所要做的就是迭代 CacheRecords1 并在标签为时将列的索引与 defdict 中的键进行匹配

  dfdata = []


  bdata = '/tmp/scantrack_tmp/xl/pivotCache/pivotCacheRecords1.xml'

  for event, elem in xml.etree.ElementTree.iterparse(bdata, events=('start', 'end')):
    if elem.tag == '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}r' and event == 'start':
       tmpdata = []
       for cidx, valueobj in enumerate(elem.getchildren()):
           tagname = valueobj.tag
           vattrib = valueobj.attrib.get('v')
           rdata = vattrib
           if tagname == '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}x':
                try:
                  rdata = defdict[cidx][int(vattrib)]
                except:
                  logging.error('this it not should happen index cidx = {} vattrib = {} defaultidcts = {} tmpdata for the time = {} xml raw {}'.format(
                                                                                                                                                cidx, vattrib, defdict, tmpdata,
                                                                                                                                                xml.etree.ElementTree.tostring(elem, encoding='utf8', method='xml')
                                                                                                                                                ))
           tmpdata.append(rdata)
       if tmpdata:
           dfdata.append(tmpdata)
       elem.clear()

然后你可以将 dfdata 放入数据框中

df = pd.DataFrame(dfdata).

剩下的就是历史，我希望这对你有帮助。

快乐编码！！！

i used to have the same issue. You can resolved by unzip the xlsx and reading/interpret the xml files. The two files that are more important are these.

xl/pivotCache/pivotCacheDefinition1.xml
xl/pivotCache/pivotCacheRecords1.xml

The first one, have the relationshit of the raw data in pivotCacheRecords1.xml, that you need to access by index number, what i mean by this, is that by every column in pivotCacheRecords1.xml that have the tag <x> you need to obtain the data in pivotCacheDefinition1.xml by the index number of the tag <x>, for better understanding, you need to see the xml files.

pivotCacheDefinition1.xml

    <?xml version="1.0" encoding="UTF-8"?>
<pivotCacheDefinition xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" r:id="rId1" refreshedBy="ADNLatam" refreshedDate="42972.64919178241" createdVersion="5" refreshedVersion="6" recordCount="1923161">
   <cacheSource type="external" connectionId="1" />
   <cacheFields count="26">
      <cacheField name="C - Cadenas" numFmtId="0" sqlType="-9">
         <sharedItems count="3">
            <s v="superA" />
            <s v="superB" />
            <s v="superC" u="1" />
         </sharedItems>
      </cacheField>
      <cacheField name="C - Locales" numFmtId="0" sqlType="-9"><span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>
         <sharedItems count="80">
            <s v="Itaugua" />
            <s v="Denis Roa" />
            <s v="Total" />
            <s v="Los Laureles" />
            <s v="CDE" />
            <s v="S6 Fdo." />
            <s v="Central" u="1" />
            <s v="Unicompra" u="1" />
            <s v="San Lorenzo Centro" u="1" />
         </sharedItems>
      </cacheField>
   </cacheFields>
</pivotCacheDefinition>
</xml>

pivotCacheRecords1.xml

<pivotCacheRecords
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" count="246209">
<r>
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <s v="PAÐAL "PAMPERS" BABYSAN REGULAR GDE 9UN"/> #Z - Sku / Descripcion
    <s v="07501006720341"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <n v="1"/> 
    <n v="11990"/> 
    <n v="2.3199999999999998"/> 
    <n v="10900"/> 
    <n v="11990"/> 
    <n v="1"/> 
    <d v="2012-02-03T00:00:00"/> 
    <x v="0"/> 
    <x v="0"/> 
    <n v="3"/> 
    <n v="6"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
</r>

See that the <x> in CacheRecords1 tag is a relation for the <s> tag in CacheDefinition1, now if you understand this, is not so dificult to make a dict to use it in the iterations of the records.

      definitions = '/tmp/scantrack_tmp/xl/pivotCache/pivotCacheDefinition1.xml'
      defdict = {}
      columnas = []
      e = xml.etree.ElementTree.parse(definitions).getroot()
      for fields in e.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}cacheFields'):
          for cidx, field in enumerate(fields.getchildren()):
              columna = field.attrib.get('name')
              defdict[cidx] = []
              columnas.append(columna)
              for value in field.getchildren()[0].getchildren():
                  tagname = value.tag
                  defdict[cidx].append(value.attrib.get('v', 0))

We endup whith this dict.

{
  0: ['supera', 'superb', u'superc'],
  1: ['Terminal',
     'CDE',
     'Brasilia',
     ]
  3: ['PANTENE', 'DOVE']
  ...
}

Then all you have todo is iterate over CacheRecords1 and match the index of the column with the key in defdict when the tag is <x>

  dfdata = []


  bdata = '/tmp/scantrack_tmp/xl/pivotCache/pivotCacheRecords1.xml'

  for event, elem in xml.etree.ElementTree.iterparse(bdata, events=('start', 'end')):
    if elem.tag == '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}r' and event == 'start':
       tmpdata = []
       for cidx, valueobj in enumerate(elem.getchildren()):
           tagname = valueobj.tag
           vattrib = valueobj.attrib.get('v')
           rdata = vattrib
           if tagname == '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}x':
                try:
                  rdata = defdict[cidx][int(vattrib)]
                except:
                  logging.error('this it not should happen index cidx = {} vattrib = {} defaultidcts = {} tmpdata for the time = {} xml raw {}'.format(
                                                                                                                                                cidx, vattrib, defdict, tmpdata,
                                                                                                                                                xml.etree.ElementTree.tostring(elem, encoding='utf8', method='xml')
                                                                                                                                                ))
           tmpdata.append(rdata)
       if tmpdata:
           dfdata.append(tmpdata)
       elem.clear()

Then you can put dfdata in a dataframe

df = pd.DataFrame(dfdata).

The rest is history, i wish this would help you.

Happy Coding!!!

回复收藏 0 原文

半﹌身腐败 2024-10-14 22:23:09

您尝试过 xlrd 吗？另请参阅 python-excel 网站中提供的教程。

就这么简单：

>>> import xlrd
>>> book = xlrd.open_workbook('pivot_table_demo.xls')
>>> sheet = book.sheet_by_name('Summary')
>>> for row_index in xrange(sheet.nrows):
...     print sheet.row_values(row_index)
...
[u'Sum of sales', u'qtr', '', '', '', '']
[u'person', 1.0, 2.0, 3.0, 4.0, u'Grand Total']
[u'dick', 100.0, 99.0, 95.0, 90.0, 384.0]
[u'harriet', 100.0, 110.0, 121.0, 133.1, 464.1]
[u'tom', 100.0, 101.0, 102.0, 103.0, 406.0]
[u'Grand Total', 300.0, 310.0, 318.0, 326.1, 1254.1]
>>>

Have you tried xlrd? See also the tutorial available from the python-excel website.

It's this simple:

>>> import xlrd
>>> book = xlrd.open_workbook('pivot_table_demo.xls')
>>> sheet = book.sheet_by_name('Summary')
>>> for row_index in xrange(sheet.nrows):
...     print sheet.row_values(row_index)
...
[u'Sum of sales', u'qtr', '', '', '', '']
[u'person', 1.0, 2.0, 3.0, 4.0, u'Grand Total']
[u'dick', 100.0, 99.0, 95.0, 90.0, 384.0]
[u'harriet', 100.0, 110.0, 121.0, 133.1, 464.1]
[u'tom', 100.0, 101.0, 102.0, 103.0, 406.0]
[u'Grand Total', 300.0, 310.0, 318.0, 326.1, 1254.1]
>>>

回复收藏 0 原文

~没有更多了~