使用 python xlrd 从 Excel 单元格获取公式

发布于 2024-10-12 04:20:19 字数 1190 浏览 4 评论 0原文

我必须将算法从 Excel 工作表移植到 Python 代码,但我必须从 Excel 文件对算法进行逆向工程

Excel工作表非常复杂,它包含许多单元格,其中有引用其他单元格的公式(也可以包含公式或常量)。

我的想法是用 python 脚本分析工作表,构建单元格之间的依赖关系表,即:

A1 取决于 B4,C5,E7 公式:“=sqrt(B4)+C5*E7”
A2取决于B5,C6公式:“=sin(B5)*C6”
...

xlrd python 模块允许读取 XLS 工作簿,但目前我可以访问单元格的,而不是公式

例如,使用以下代码我可以简单地获取单元格的值:

import xlrd

#open the .xls file
xlsname="test.xls"
book = xlrd.open_workbook(xlsname)

#build a dictionary of the names->sheets of the book
sd={}
for s in book.sheets():
    sd[s.name]=s

#obtain Sheet "Foglio 1" from sheet names dictionary
sheet=sd["Foglio 1"]

#print value of the cell J141
print sheet.cell(142,9)

无论如何,似乎无法从 .cell(...) 返回的 Cell 对象中获取公式方法。 在文档中,他们说可以获取公式的字符串版本(英文,因为 Excel 文件中没有存储有关函数名称翻译的信息)。他们谈论 NameOperand 类中的公式(表达式),无论如何我无法理解如何通过 Cell 获取这些类的实例必须包含它们的类实例。

您能否建议一个从单元格获取公式文本的代码片段?

I have to port an algorithm from an Excel sheet to python code but I have to reverse engineer the algorithm from the Excel file.

The Excel sheet is quite complicated, it contains many cells in which there are formulas that refer to other cells (that can also contains a formula or a constant).

My idea is to analyze with a python script the sheet building a sort of table of dependencies between cells, that is:

A1 depends on B4,C5,E7 formula: "=sqrt(B4)+C5*E7"
A2 depends on B5,C6 formula: "=sin(B5)*C6"
...

The xlrd python module allows to read an XLS workbook but at the moment I can access to the value of a cell, not the formula.

For example, with the following code I can get simply the value of a cell:

import xlrd

#open the .xls file
xlsname="test.xls"
book = xlrd.open_workbook(xlsname)

#build a dictionary of the names->sheets of the book
sd={}
for s in book.sheets():
    sd[s.name]=s

#obtain Sheet "Foglio 1" from sheet names dictionary
sheet=sd["Foglio 1"]

#print value of the cell J141
print sheet.cell(142,9)

Anyway, It seems to have no way to get the formul from the Cell object returned by the .cell(...) method.
In documentation they say that it is possible to get a string version of the formula (in english because there is no information about function name translation stored in the Excel file). They speak about formulas (expressions) in the Name and Operand classes, anyway I cannot understand how to get the instances of these classes by the Cell class instance that must contains them.

Could you suggest a code snippet that gets the formula text from a cell?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

多孤肩上扛 2024-10-19 04:20:19

[Dis]声明者:我是 xlrd 的作者/维护者。

对公式文本的文档引用是关于“名称”公式的;阅读文档开头附近的“命名引用、常量、公式和宏”部分。这些公式在整个工作表或整个书籍范围内与名称相关联;它们不与单个细胞相关。示例:PI 映射到 =22/7SALES 映射到 =Mktng!$A$2:$Z$99 >。编写名称公式反编译器是为了支持检查已定义名称的更简单和/或常见用法。

公式通常有几种类型:单元格、共享和数组(全部与单元格直接或间接关联)、名称、数据验证和条件格式。

将通用公式从字节码反编译为文本是一项“正在进行的工作”,进展缓慢。请注意,假设它可用,您将需要解析文本公式以提取单元格引用。正确解析 Excel 公式并不是一件容易的事;与 HTML 一样,使用正则表达式看起来很简单,但行不通。最好直接从公式字节码中提取引用。

另请注意,基于单元格的公式可以引用名称,并且名称公式可以引用单元格和其他名称。因此,有必要从基于单元格的公式和名称公式中提取单元格和名称引用。有关可用共享公式的信息可能对您有用;否则解析以下内容:

B2 =A2
B3 =A3+B2
B4 =A4+B3
B5 =A5+B4
...
B60 =A60+B59

您需要自己推断 B3:B60 公式之间的相似性。

无论如何,上述任何一个都不太可能很快可用——xlrd优先级在其他地方。

[Dis]claimer: I'm the author/maintainer of xlrd.

The documentation references to formula text are about "name" formulas; read the section "Named references, constants, formulas, and macros" near the start of the docs. These formulas are associated sheet-wide or book-wide to a name; they are not associated with individual cells. Examples: PI maps to =22/7, SALES maps to =Mktng!$A$2:$Z$99. The name-formula decompiler was written to support inspection of the simpler and/or commonly found usages of defined names.

Formulas in general are of several kinds: cell, shared, and array (all associated with a cell, directly or indirectly), name, data validation, and conditional formatting.

Decompiling general formulas from bytecode to text is a "work-in-progress", slowly. Note that supposing it were available, you would then need to parse the text formula to extract the cell references. Parsing Excel formulas correctly is not an easy job; as with HTML, using regexes looks easy but doesn't work. It would be better to extract the references directly from the formula bytecode.

Also note that cell-based formulas can refer to names, and name formulas can refer both to cells and to other names. So it would be necessary to extract both cell and name references from both cell-based and name formulas. It may be useful to you to have info on shared formulas available; otherwise having parsed the following:

B2 =A2
B3 =A3+B2
B4 =A4+B3
B5 =A5+B4
...
B60 =A60+B59

you would need to deduce the similarity between the B3:B60 formulas yourself.

In any case, none of the above is likely to be available any time soon -- xlrd priorities lie elsewhere.

独﹏钓一江月 2024-10-19 04:20:19

更新:我已经实现了一个小库来完全按照您的描述进行操作:提取单元格和数据。来自 Excel 电子表格的依赖项并将其转换为 Python 代码。代码 在 github 上,欢迎补丁:)


只是补充一下,您始终可以使用 win32com (不是很快,但它可以工作)。这确实可以让您获得公式。 可以在此处找到教程 [缓存副本],详细信息可参见本章 [缓存副本]

本质上你只需要做:

app.ActiveWorkbook.ActiveSheet.Cells(r,c).Formula

至于构建单元格依赖关系表,一件棘手的事情是解析 Excel 表达式。如果我没记错的话,您提到的跟踪代码并不总是正确执行此操作。我见过的最好的是EW Bachtal 的算法,其中有一个 python 实现,效果很好。

Update: I have gone and implemented a little library to do exactly what you describe: extracting the cells & dependencies from an Excel spreadsheet and converting them to python code. Code is on github, patches welcome :)


Just to add that you can always interact with excel using win32com (not very fast but it works). This does allow you to get the formula. A tutorial can be found here [cached copy] and details can be found in this chapter [cached copy].

Essentially you just do:

app.ActiveWorkbook.ActiveSheet.Cells(r,c).Formula

As for building a table of cell dependencies, a tricky thing is parsing the excel expressions. If I remember correctly the Trace code you mentioned does not always do this correctly. The best I have seen is the algorithm by E. W. Bachtal, of which a python implementation is available which works well.

波浪屿的海角声 2024-10-19 04:20:19

所以我知道这是一篇非常旧的帖子,但我找到了一种不错的方法,可以从工作簿中的所有工作表中获取公式,并使新创建的工作簿保留所有格式。

第一步是将 .xlsx 文件的副本另存为 .xls
-- 在下面的代码中使用 .xls 作为文件名

使用 Python 2.7

from lxml import etree
from StringIO import StringIO
import xlsxwriter
import subprocess
from xlrd import open_workbook
from xlutils.copy import copy
from xlsxwriter.utility import xl_cell_to_rowcol
import os



file_name = '<YOUR-FILE-HERE>'
dir_path = os.path.dirname(os.path.realpath(file_name))

subprocess.call(["unzip",str(file_name+"x"),"-d","file_xml"])


xml_sheet_names = dict()

with open_workbook(file_name,formatting_info=True) as rb:
    wb = copy(rb)
    workbook_names_list = rb.sheet_names()
    for i,name in enumerate(workbook_names_list):
        xml_sheet_names[name] = "sheet"+str(i+1)

sheet_formulas = dict()
for i, k in enumerate(workbook_names_list):
    xmlFile = os.path.join(dir_path,"file_xml/xl/worksheets/{}.xml".format(xml_sheet_names[k]))
    with open(xmlFile) as f:
        xml = f.read()

    tree = etree.parse(StringIO(xml))
    context = etree.iterparse(StringIO(xml))

    sheet_formulas[k] = dict()
    for _, elem in context:
        if elem.tag.split("}")[1]=='f':
            cell_key = elem.getparent().get(key="r")
            cell_formula = elem.text
            sheet_formulas[k][cell_key] = str("="+cell_formula)

sheet_formulas

字典“sheet_formulas”的结构

{'Worksheet_Name': {'A1_cell_reference':'cell_formula'}}

示例结果:

{u'CY16': {'A1': '=Data!B5',
  'B1': '=Data!B1',
  'B10': '=IFERROR(Data!B12,"")',
  'B11': '=IFERROR(SUM(B9:B10),"")',

So I know this is a very old post, but I found a decent way of getting the formulas from all the sheets in a workbook as well as having the newly created workbook retain all the formatting.

First step is to save a copy of your .xlsx file as .xls
-- Use the .xls as the filename in the code below

Using Python 2.7

from lxml import etree
from StringIO import StringIO
import xlsxwriter
import subprocess
from xlrd import open_workbook
from xlutils.copy import copy
from xlsxwriter.utility import xl_cell_to_rowcol
import os



file_name = '<YOUR-FILE-HERE>'
dir_path = os.path.dirname(os.path.realpath(file_name))

subprocess.call(["unzip",str(file_name+"x"),"-d","file_xml"])


xml_sheet_names = dict()

with open_workbook(file_name,formatting_info=True) as rb:
    wb = copy(rb)
    workbook_names_list = rb.sheet_names()
    for i,name in enumerate(workbook_names_list):
        xml_sheet_names[name] = "sheet"+str(i+1)

sheet_formulas = dict()
for i, k in enumerate(workbook_names_list):
    xmlFile = os.path.join(dir_path,"file_xml/xl/worksheets/{}.xml".format(xml_sheet_names[k]))
    with open(xmlFile) as f:
        xml = f.read()

    tree = etree.parse(StringIO(xml))
    context = etree.iterparse(StringIO(xml))

    sheet_formulas[k] = dict()
    for _, elem in context:
        if elem.tag.split("}")[1]=='f':
            cell_key = elem.getparent().get(key="r")
            cell_formula = elem.text
            sheet_formulas[k][cell_key] = str("="+cell_formula)

sheet_formulas

Structure of Dictionary 'sheet_formulas'

{'Worksheet_Name': {'A1_cell_reference':'cell_formula'}}

Example results:

{u'CY16': {'A1': '=Data!B5',
  'B1': '=Data!B1',
  'B10': '=IFERROR(Data!B12,"")',
  'B11': '=IFERROR(SUM(B9:B10),"")',
怕倦 2024-10-19 04:20:19

看来现在用xlrd做你想做的事情是不可能的了。您可以查看 这篇文章详细说明了为什么实现您需要的功能如此困难。

请注意,开发团队在 python-excel google 小组的支持方面做得非常出色。

It seems that it is impossible now to do what you want with xlrd. You can have a look at this post for the detailed description of why it is so difficult to implement the functionality you need.

Note that the developping team does a great job for support at the python-excel google group.

还在原地等你 2024-10-19 04:20:19

耶!有了 win32com,它对我有用。

import    win32com.client
Excel = win32com.client.Dispatch("Excel.Application")

# python -m pip install pywin32
file=r'path Excel file'
wb = Excel.Workbooks.Open(file)
sheet = wb.ActiveSheet

#Get value
val = sheet.Cells(1,1).value
# Get Formula
sheet.Cells(6,2).Formula

Ye! With win32com it's works for me.

import    win32com.client
Excel = win32com.client.Dispatch("Excel.Application")

# python -m pip install pywin32
file=r'path Excel file'
wb = Excel.Workbooks.Open(file)
sheet = wb.ActiveSheet

#Get value
val = sheet.Cells(1,1).value
# Get Formula
sheet.Cells(6,2).Formula
情泪▽动烟 2024-10-19 04:20:19

您可以使用cell.data_type来查找单元格是否包含公式。如果它返回'f',则表示它是一个公式。

import openpyxl
wb = openpyxl.load_workbook(os.path.join(folder_name,filename),data_only=False)
ws = wb['Sheet1']
all_rows = list(ws.rows)
print(f"Found {len(all_rows)} rows of data.")
print("\nFirst rows of data:")
for row in all_rows[:5]:
    print(row)
    
for cell in all_rows[1]:
    print(cell.data_type,cell.value) 

上述代码的结果是:

Found 40 rows of data.

First rows of data:
(<Cell 'Sheet1'.A1>, <Cell 'Sheet1'.B1>, <Cell 'Sheet1'.C1>, <Cell 'Sheet1'.D1>, <Cell 'Sheet1'.E1>, <Cell 'Sheet1'.F1>, <Cell 'Sheet1'.G1>, <Cell 'Sheet1'.H1>)
(<Cell 'Sheet1'.A2>, <Cell 'Sheet1'.B2>, <Cell 'Sheet1'.C2>, <Cell 'Sheet1'.D2>, <Cell 'Sheet1'.E2>, <Cell 'Sheet1'.F2>, <Cell 'Sheet1'.G2>, <Cell 'Sheet1'.H2>)
(<Cell 'Sheet1'.A3>, <Cell 'Sheet1'.B3>, <Cell 'Sheet1'.C3>, <Cell 'Sheet1'.D3>, <Cell 'Sheet1'.E3>, <Cell 'Sheet1'.F3>, <Cell 'Sheet1'.G3>, <Cell 'Sheet1'.H3>)
(<Cell 'Sheet1'.A4>, <Cell 'Sheet1'.B4>, <Cell 'Sheet1'.C4>, <Cell 'Sheet1'.D4>, <Cell 'Sheet1'.E4>, <Cell 'Sheet1'.F4>, <Cell 'Sheet1'.G4>, <Cell 'Sheet1'.H4>)
(<Cell 'Sheet1'.A5>, <Cell 'Sheet1'.B5>, <Cell 'Sheet1'.C5>, <Cell 'Sheet1'.D5>, <Cell 'Sheet1'.E5>, <Cell 'Sheet1'.F5>, <Cell 'Sheet1'.G5>, <Cell 'Sheet1'.H5>)
s John Doe
n 30
s Male
s Manager
n 60000
f =IF(AND(C2="Male", D2="Manager"), E2*1.1,
 IF(AND(C2="Female", D2="Director"), E2*1.15,
 E2*1.05))
n 1
s MMaCODE1

You can use cell.data_type to find whether a cell contains a formula or not. If it returns 'f' that means it is a formula.

import openpyxl
wb = openpyxl.load_workbook(os.path.join(folder_name,filename),data_only=False)
ws = wb['Sheet1']
all_rows = list(ws.rows)
print(f"Found {len(all_rows)} rows of data.")
print("\nFirst rows of data:")
for row in all_rows[:5]:
    print(row)
    
for cell in all_rows[1]:
    print(cell.data_type,cell.value) 

The result for above code is:

Found 40 rows of data.

First rows of data:
(<Cell 'Sheet1'.A1>, <Cell 'Sheet1'.B1>, <Cell 'Sheet1'.C1>, <Cell 'Sheet1'.D1>, <Cell 'Sheet1'.E1>, <Cell 'Sheet1'.F1>, <Cell 'Sheet1'.G1>, <Cell 'Sheet1'.H1>)
(<Cell 'Sheet1'.A2>, <Cell 'Sheet1'.B2>, <Cell 'Sheet1'.C2>, <Cell 'Sheet1'.D2>, <Cell 'Sheet1'.E2>, <Cell 'Sheet1'.F2>, <Cell 'Sheet1'.G2>, <Cell 'Sheet1'.H2>)
(<Cell 'Sheet1'.A3>, <Cell 'Sheet1'.B3>, <Cell 'Sheet1'.C3>, <Cell 'Sheet1'.D3>, <Cell 'Sheet1'.E3>, <Cell 'Sheet1'.F3>, <Cell 'Sheet1'.G3>, <Cell 'Sheet1'.H3>)
(<Cell 'Sheet1'.A4>, <Cell 'Sheet1'.B4>, <Cell 'Sheet1'.C4>, <Cell 'Sheet1'.D4>, <Cell 'Sheet1'.E4>, <Cell 'Sheet1'.F4>, <Cell 'Sheet1'.G4>, <Cell 'Sheet1'.H4>)
(<Cell 'Sheet1'.A5>, <Cell 'Sheet1'.B5>, <Cell 'Sheet1'.C5>, <Cell 'Sheet1'.D5>, <Cell 'Sheet1'.E5>, <Cell 'Sheet1'.F5>, <Cell 'Sheet1'.G5>, <Cell 'Sheet1'.H5>)
s John Doe
n 30
s Male
s Manager
n 60000
f =IF(AND(C2="Male", D2="Manager"), E2*1.1,
 IF(AND(C2="Female", D2="Director"), E2*1.15,
 E2*1.05))
n 1
s MMaCODE1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文