Python 中的 DSL 用于表格分析和处理？

发布于 2024-12-26 13:10:49 字数 3006 浏览 5 评论 0原文

我的想法是创建一种用于表格分析的有限形式的语言。问题是是否已经有类似以下内容的东西，或者是否有更好的想法来做到这一点。关键要求是用户必须能够使用裸脚本文件，而无需生成完整的 Python 程序。脚本语言的 OTOH 元素在运行时编译，并且应该提供制定复杂条件和计算（主要是算术和字符串运算）的可能性。用该语言制定程序的声明性方式（见下文）禁止直接使用 Python 语法作为该语言的载体（请参阅 @PART 函数的含义）——至少我是这么认为的。有没有更好/更聪明/更优雅的方法来实现我的目标，而不是像我下面那样使用 Python 和笨拙的自定义语法的半解析混合物进行编程？

下面我尝试通过一个例子来阐明我的想法。输入表由另一个程序部分生成，在脚本解释开始时看起来像这样：

# First,Last,Department,Hourly Wage
[ ('Greg','Foo','HR',100),
  ('Judy','Bar','EE',51),
  ('Jake','Baz','HR',75),
  ('Lila','Bax','HR',49),
  ('Norm','Fob','EE',49) ]

以下是“脚本”文件本身。这将是生产系统中的一个文件。程序代码目前被放置为 Python 字符串数组 - 也许在最终版本中还不是这样。

# A program to produce per department the average hourly rate, separated for the higher and lower 50% of earners: 
[ "@SORT(2,-3)", 
     "@SET({max},@MAX({3}))",
     "@PART({2}!={^2} or {3}<{max}/2)",
        "@SET({dep},@FIRST({2}))",
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"
]

我将尝试逐步解释该脚本应执行的操作：

"@SORT(2,-3)"

在第 2 列（升序）之后对表进行排序，然后在第 3 列（降序）之后对表进行排序。我们

[ ('Judy','Bar','EE',51),
  ('Norm','Fob','EE',49),
  ('Greg','Foo','HR',100),
  ('Jake','Baz','HR',75),
  ('Lila','Bax','HR',49),
]

"@SET({max},@MAX({3}))"

获取第 3 列的最大值并将其放入动态局部变量 max 中，

"@PART({2}!={^2} or {3}<{max}/2)"

这有点复杂。 @PART 对当前表进行分区通过评估每行的给定表达式并在该行前面切割（如果为真），将其分成几个子表。在这里，我们要在部门边界处进行剪切（第 2 列）。 {^2}是向上引用，表示第2列的元素从上一行开始。这种语法是必要的，因为我认为能够在更复杂的条件下对表进行分区比“行与 X 中的前一行不同”非常重要（想象一下您想要将表划分为 10k 收入类别）所以我需要 PART 参数中（受限）Python 表达式的表达能力。这也有寓意由于没有前驱，因此无法计算第一行的表达式，因此 PART 函数我们会简单地回顾一下它。在此函数之后，我们有以下表格：

[ ('Judy','Bar','EE',51) ] # Department EE

[ ('Norm','Fob','EE',49) ] # Norm Fob is in the same department but earns less than half of the maximum

[ ('Greg','Foo','HR',100), # New department HR
  ('Jake','Baz','HR',75) ]

[ ('Lila','Bax','HR',49) ] # HR dept. but less than half of the best earner

从这里开始，脚本中的函数将分别对每个子表起作用。或多或少有一个PART功能在所有生成的子表上启动循环，并且执行以下每个函数（包括更多部分）隔离的子表。

"@SET({dep},@FIRST({2}))",
"@PRINT({dep},float(@SUM({3}))/@CNT({3}))"

@FIRST({2}) 简单地获取第一行第 2 列的值。 @SUM({3}) 取整个第 3 列的总和 @CNT({3}) 正在计算第 3 列不等于 None 的行数。我大致在这里描绘了函数的结果：

[ ('Judy','Bar','EE',51) ]
        "@SET({dep},@FIRST({2}))"  --> {dep} = "EE"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "EE 51"

[ ('Norm','Fob','EE',49) ]
        "@SET({dep},@FIRST({2}))", --> {dep} = "EE"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "EE 49"

[ ('Greg','Foo','HR',100), 
  ('Jake','Baz','HR',75) ]
        "@SET({dep},@FIRST({2}))", --> {dep} = "HR"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "HR 87.5"

[ ('Lila','Bax','HR',49) ] 
        "@SET({dep},@FIRST({2}))", --> {dep} = "HR"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "HR 49"

我应该补充一点，我更喜欢保持解决方案较小，即不使用非标准 Python 包，如 pyparsing 等。

原文

I am carrying around the idea to create a restricted form of a language for tabular analysis. Thequestion is if there is something like the following already around or if there are better ideas todo it. The key requirement is that the user must be able to employ naked script files without thenecessity to produce a full Python program. OTOH elements of the script language are compiled at runtime andshould offer a possibility to formulate complex conditions and calculations (mostly arithmetic and string operations). The declarative way to formulate programs in the language (see below) prohibits the direct use of Python syntax as the vehicle for the language (see the implications of the @PART function) - at least I think so.
Is there a better/ more clever/ more elegant way to achieve my goals than to program around in a semi-parsed mixture of Python and clumsy self-defined syntax like I did below?

Below I try to make my ideas clear by an example. The input table is produced by another program part and looks like this on start of the script interpretation:

# First,Last,Department,Hourly Wage
[ ('Greg','Foo','HR',100),
  ('Judy','Bar','EE',51),
  ('Jake','Baz','HR',75),
  ('Lila','Bax','HR',49),
  ('Norm','Fob','EE',49) ]

Following is the 'script' file itself. This will be a file for itself in a production system. The program code is currently laid down as an array of Python strings - maybe not even that in a final version.

# A program to produce per department the average hourly rate, separated for the higher and lower 50% of earners: 
[ "@SORT(2,-3)", 
     "@SET({max},@MAX({3}))",
     "@PART({2}!={^2} or {3}<{max}/2)",
        "@SET({dep},@FIRST({2}))",
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"
]

I will try to explain step by step what the script shall do:

"@SORT(2,-3)"

sorts the table after column 2 (ascending), then column 3 (descending). We get

[ ('Judy','Bar','EE',51),
  ('Norm','Fob','EE',49),
  ('Greg','Foo','HR',100),
  ('Jake','Baz','HR',75),
  ('Lila','Bax','HR',49),
]

"@SET({max},@MAX({3}))"

takes the maximum of column 3 and puts it into a dynamic, local variable max

"@PART({2}!={^2} or {3}<{max}/2)"

is a bit more complex. @PART partitions the current table
into several sub-tables by evaluating the given expression for each row and cutting in front of the row if true.
Here we want to cut at department-borders (column 2). {^2} is an upward reference, meaning the element in column 2
from the previous row. This syntax is necessary as I deem the ability to partition tables on conditions more complex
than "row differs from previous row in X" very important (imagine you want to partition the table in 10k-income classes)
so I need the expressive power of a (restricted) Python expression in the argument to PART. Also this has the implication
that the expression can not be evaluated for the first row as there is no predecessor, so the PART-function
will simply go over it.
After this function we have the following tables:

[ ('Judy','Bar','EE',51) ] # Department EE

[ ('Norm','Fob','EE',49) ] # Norm Fob is in the same department but earns less than half of the maximum

[ ('Greg','Foo','HR',100), # New department HR
  ('Jake','Baz','HR',75) ]

[ ('Lila','Bax','HR',49) ] # HR dept. but less than half of the best earner

From here on, the functions in the script will work on each sub-table separately. A PART function more or less
starts a loop over all resulting sub-tables and every following function (including more PARTs) is executed on
a subtable in isolation.

"@SET({dep},@FIRST({2}))",
"@PRINT({dep},float(@SUM({3}))/@CNT({3}))"

@FIRST({2}) takes the value of column 2 of the first row simply. @SUM({3}) takes the sum of the whole column 3
and @CNT({3}) is counting the number of rows which have column 3 unequal to None.
I picture the functions outcome roughly here:

[ ('Judy','Bar','EE',51) ]
        "@SET({dep},@FIRST({2}))"  --> {dep} = "EE"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "EE 51"

[ ('Norm','Fob','EE',49) ]
        "@SET({dep},@FIRST({2}))", --> {dep} = "EE"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "EE 49"

[ ('Greg','Foo','HR',100), 
  ('Jake','Baz','HR',75) ]
        "@SET({dep},@FIRST({2}))", --> {dep} = "HR"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "HR 87.5"

[ ('Lila','Bax','HR',49) ] 
        "@SET({dep},@FIRST({2}))", --> {dep} = "HR"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "HR 49"

I should add that I'd prefer to keep the solution small, i.e. no use of non-standard Python packages like pyparsing etc.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

橪书 2025-01-02 13:10:49

Littletable 是我编写的一个模块，用于对类似项目列表进行一些表格分析。 Littletable 不使用 SQL 进行选择和查询，但可以对字段进行索引，并且表可以进行联接、透视和查询。表格可以像 Python 列表一样对待。关于littletable的最大哲学点可能是每个连接、查询等都会返回一个新表，以便可以从中间连接和查询构建复杂的表达式。操作数据的一些示例：

attrs = "Id,First,Last,Department,Hourly_Wage".split(',') 
data = [ (1, 'Greg','Foo','HR',100), 
          (2, 'Judy','Bar','EE',51), 
          (3, 'Jake','Bar','HR',75), 
          (4, 'Lila','Bax','HR',49), 
          (5, 'Norm','Fob','EE',49) ] 

from littletable import Table, DataObject

instructors = Table()
instructors.insert_many(
    DataObject(**dict(zip(attrs,d))) 
        for d in data)

# can add index before or after items are added to table
instructors.create_index("Id", unique=True)
instructors.create_index("Department")

# unique keys are enforced
try:
    instructors.insert(DataObject(Id=4, First="Bob", Last="Fob"))
except KeyError as e:
    print e

# keys checked for uniqueness when creating unique index
try:
    instructors.create_index("Last", unique=True)
except KeyError as e:
    print e

# Uniquely indexed access gives a single record
print "%(First)s %(Last)s" % instructors.by.Id[3]

# Non-uniquely indexed access gives a new Table
print '\n'.join("%(Department)s %(First)s %(Last)s" % inst 
                    for inst in instructors.by.Department["HR"])

# Table can still be accessed like a Python list
print "%(First)s %(Last)s" % instructors[-1]
print '\n'.join("%(Department)s %(First)s %(Last)s" % inst 
                    for inst in instructors)

# use pivot for multi-dimensional grouping
instructors.addfield("wage_bracket", lambda d:d.Hourly_Wage/10*10)
instructors.create_index("wage_bracket")
instructors.pivot("Department wage_bracket").dump()
instructors.pivot("Department wage_bracket").dump_counts()

import sys
instructors.csv_export(sys.stdout)

以下是使用 littletable: prints

("duplicate unique key value '4' for index Id", {'Last': 'Fob', 'Id': 4, 'First': 'Bob'})
'duplicate key value Bar'
Jake Bar
HR Greg Foo
HR Jake Bar
HR Lila Bax
Norm Fob
HR Greg Foo
EE Judy Bar
HR Jake Bar
HR Lila Bax
EE Norm Fob
Pivot: Department,wage_bracket
  Department:EE
    Department:EE/wage_bracket:40
      {'Last': 'Fob', 'Hourly_Wage': 49, 'Department': 'EE', 'wage_bracket': 40, 'Id': 5, 'First': 'Norm'}
    Department:EE/wage_bracket:50
      {'Last': 'Bar', 'Hourly_Wage': 51, 'Department': 'EE', 'wage_bracket': 50, 'Id': 2, 'First': 'Judy'}
  Department:HR
    Department:HR/wage_bracket:40
      {'Last': 'Bax', 'Hourly_Wage': 49, 'Department': 'HR', 'wage_bracket': 40, 'Id': 4, 'First': 'Lila'}
    Department:HR/wage_bracket:70
      {'Last': 'Bar', 'Hourly_Wage': 75, 'Department': 'HR', 'wage_bracket': 70, 'Id': 3, 'First': 'Jake'}
    Department:HR/wage_bracket:100
      {'Last': 'Foo', 'Hourly_Wage': 100, 'Department': 'HR', 'wage_bracket': 100, 'Id': 1, 'First': 'Greg'}
Pivot: Department,wage_bracket
              40         50         70        100      Total
EE             1          1          0          0          2
HR             1          0          1          1          3
Total          2          1          1          1          5
Last,Hourly_Wage,Department,wage_bracket,Id,First
Foo,100,HR,100,1,Greg
Bar,51,EE,50,2,Judy
Bar,75,HR,70,3,Jake
Bax,49,HR,40,4,Lila
Fob,49,EE,40,5,Norm

littletable is a module that I wrote to do some of this tabular analysis of a list of like items. littletable does not use SQL for selects and queries, but fields can be indexed, and tables can do joins, pivots, and queries. Tables can be treated just like Python lists. Probably the biggest philosophical point about littletable is that every join, query, etc. returns a new Table, so that a complex expression can be built up from intermediate joins and queries. Here are some samples of manipulating your data with littletable:

attrs = "Id,First,Last,Department,Hourly_Wage".split(',') 
data = [ (1, 'Greg','Foo','HR',100), 
          (2, 'Judy','Bar','EE',51), 
          (3, 'Jake','Bar','HR',75), 
          (4, 'Lila','Bax','HR',49), 
          (5, 'Norm','Fob','EE',49) ] 

from littletable import Table, DataObject

instructors = Table()
instructors.insert_many(
    DataObject(**dict(zip(attrs,d))) 
        for d in data)

# can add index before or after items are added to table
instructors.create_index("Id", unique=True)
instructors.create_index("Department")

# unique keys are enforced
try:
    instructors.insert(DataObject(Id=4, First="Bob", Last="Fob"))
except KeyError as e:
    print e

# keys checked for uniqueness when creating unique index
try:
    instructors.create_index("Last", unique=True)
except KeyError as e:
    print e

# Uniquely indexed access gives a single record
print "%(First)s %(Last)s" % instructors.by.Id[3]

# Non-uniquely indexed access gives a new Table
print '\n'.join("%(Department)s %(First)s %(Last)s" % inst 
                    for inst in instructors.by.Department["HR"])

# Table can still be accessed like a Python list
print "%(First)s %(Last)s" % instructors[-1]
print '\n'.join("%(Department)s %(First)s %(Last)s" % inst 
                    for inst in instructors)

# use pivot for multi-dimensional grouping
instructors.addfield("wage_bracket", lambda d:d.Hourly_Wage/10*10)
instructors.create_index("wage_bracket")
instructors.pivot("Department wage_bracket").dump()
instructors.pivot("Department wage_bracket").dump_counts()

import sys
instructors.csv_export(sys.stdout)

prints:

("duplicate unique key value '4' for index Id", {'Last': 'Fob', 'Id': 4, 'First': 'Bob'})
'duplicate key value Bar'
Jake Bar
HR Greg Foo
HR Jake Bar
HR Lila Bax
Norm Fob
HR Greg Foo
EE Judy Bar
HR Jake Bar
HR Lila Bax
EE Norm Fob
Pivot: Department,wage_bracket
  Department:EE
    Department:EE/wage_bracket:40
      {'Last': 'Fob', 'Hourly_Wage': 49, 'Department': 'EE', 'wage_bracket': 40, 'Id': 5, 'First': 'Norm'}
    Department:EE/wage_bracket:50
      {'Last': 'Bar', 'Hourly_Wage': 51, 'Department': 'EE', 'wage_bracket': 50, 'Id': 2, 'First': 'Judy'}
  Department:HR
    Department:HR/wage_bracket:40
      {'Last': 'Bax', 'Hourly_Wage': 49, 'Department': 'HR', 'wage_bracket': 40, 'Id': 4, 'First': 'Lila'}
    Department:HR/wage_bracket:70
      {'Last': 'Bar', 'Hourly_Wage': 75, 'Department': 'HR', 'wage_bracket': 70, 'Id': 3, 'First': 'Jake'}
    Department:HR/wage_bracket:100
      {'Last': 'Foo', 'Hourly_Wage': 100, 'Department': 'HR', 'wage_bracket': 100, 'Id': 1, 'First': 'Greg'}
Pivot: Department,wage_bracket
              40         50         70        100      Total
EE             1          1          0          0          2
HR             1          0          1          1          3
Total          2          1          1          1          5
Last,Hourly_Wage,Department,wage_bracket,Id,First
Foo,100,HR,100,1,Greg
Bar,51,EE,50,2,Judy
Bar,75,HR,70,3,Jake
Bax,49,HR,40,4,Lila
Fob,49,EE,40,5,Norm

回复收藏 0 原文

~没有更多了~