编写领域特定语言来从表中选择行

发布于 2024-07-05 20:50:16 字数 662 浏览 10 评论 0原文

我正在编写一个服务器,我希望它由许多不同的人运行,但并不是所有我都会直接接触的人。 服务器将在集群中相互通信。 服务器的部分功能涉及从可能非常大的表中选择一小部分行。 选择哪些行的确切选择将需要一些调整,并且重要的是运行集群的人(例如我自己)可以更新选择标准,而无需让每个服务器管理员部署新版本的服务器。

简单地用 Python 编写函数并不是一个真正的选择,因为没有人会想要安装一个在运行时下载并执行任意 Python 代码的服务器。

我需要的是关于实现领域特定语言以实现此目标的最简单方法的建议。 该语言需要能够进行简单的表达式求值,以及查询表索引和迭代返回的行。 编写和阅读语言的难易程度仅次于实现语言的难易程度。 我也不想编写整个查询优化器,因此明确指定要查询的索引的东西将是理想的。

必须编译的接口在功能上与 App Engine 数据存储区导出的功能类似:您可以查询表上任何索引的顺序范围(例如,小于、大于、范围和相等查询) ,然后通过任何布尔表达式过滤返回的行。 您还可以将多个独立的结果集连接在一起。

我意识到这个问题听起来很像我在问 SQL。 但是,我不想要求支持此数据的数据存储是关系数据库,并且我不希望自己尝试重新实现 SQL 的开销。 我还只处理一个具有已知模式的表。 最后,不需要加入。 更简单的东西会更好。

编辑:扩展描述以消除一些误解。

I'm writing a server that I expect to be run by many different people, not all of whom I will have direct contact with. The servers will communicate with each other in a cluster. Part of the server's functionality involves selecting a small subset of rows from a potentially very large table. The exact choice of what rows are selected will need some tuning, and it's important that it's possible for the person running the cluster (eg, myself) to update the selection criteria without getting each and every server administrator to deploy a new version of the server.

Simply writing the function in Python isn't really an option, since nobody is going to want to install a server that downloads and executes arbitrary Python code at runtime.

What I need are suggestions on the simplest way to implement a Domain Specific Language to achieve this goal. The language needs to be capable of simple expression evaluation, as well as querying table indexes and iterating through the returned rows. Ease of writing and reading the language is secondary to ease of implementing it. I'd also prefer not to have to write an entire query optimiser, so something that explicitly specifies what indexes to query would be ideal.

The interface that this will have to compile against will be similar in capabilities to what the App Engine datastore exports: You can query for sequential ranges on any index on the table (eg, less-than, greater-than, range and equality queries), then filter the returned row by any boolean expression. You can also concatenate multiple independent result sets together.

I realise this question sounds a lot like I'm asking for SQL. However, I don't want to require that the datastore backing this data be a relational database, and I don't want the overhead of trying to reimplement SQL myself. I'm also dealing with only a single table with a known schema. Finally, no joins will be required. Something much simpler would be far preferable.

Edit: Expanded description to clear up some misconceptions.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

夜声 2024-07-12 20:50:16

听起来你想创建一个语法而不是 DSL。 我会研究 ANTLR ,它允许您创建一个特定的解析器来解释文本并转换为特定的命令。 ANTLR 提供了 Python、SQL、Java、C++、C、C# 等库。

另外,这里是 ANTLR 的一个很好的示例 计算引擎

It sounds like you want to create a grammar not a DSL. I'd look into ANTLR which will allow you to create a specific parser that will interpret text and translate to specific commands. ANTLR provides libraries for Python, SQL, Java, C++, C, C# etc.

Also, here is a fine example of an ANTLR calculation engine created in C#

刘备忘录 2024-07-12 20:50:16

构建由 Python 解释的 DSL。

步骤 1. 构建运行时类和对象。 这些类将所有游标循环和 SQL 语句以及所有算法处理隐藏在它们的方法中。 您将大量使用 命令构建这些类的策略设计模式。 大多数东西都是命令,选项和选择是插件策略。 看看 Apache Ant 的 Task API 的设计——这是一个很好的例子。

步骤 2. 验证该对象系统是否确实有效。 确保设计简单且完整。 您的测试将构造 Command 和 Strategy 对象,然后执行顶级 Command 对象。 命令对象将完成这项工作。

至此,您已基本完成。 您的运行时只是从上述域创建的对象的配置。 [这并不像听起来那么容易。 需要小心地定义一组可以实例化的类,然后“它们之间进行对话”来完成应用程序的工作。]

请注意,您所需要的只是声明。 程序上有什么问题吗? 当您开始使用过程元素编写 DSL 时,您会发现需要越来越多的功能,直到您使用不同的语法编写 Python。 不好。

此外,过程语言解释器很难编写。 执行状态和引用范围很难管理。

您可以使用本机 Python,而不必再担心“脱离沙箱”。 事实上,这就是您对所有内容进行单元测试的方式,使用简短的 Python 脚本来创建对象。 Python 将成为 DSL。

[“但是等等”,你说,“如果我简单地使用 Python 作为 DSL,人们就可以执行任意的事情。” 取决于 PYTHONPATH 和 sys.path 上的内容。 查看 site 模块,了解控制可用内容的方法。]

声明性 DSL最简单。 这完全是一种代表性的练习。 仅设置某些变量值的 Python 块很好。 这就是 Django 所使用的。

您可以使用 ConfigParser 作为表示运行时配置的语言对象。

您可以使用 JSONYAML 作为表示对象运行时配置的语言。 现成的解析器是完全可用的。

您也可以使用 XML。 设计和解析比较困难,但效果很好。 人们喜欢它。 这就是 Ant 和 Maven(以及许多其他工具)使用声明性语法来描述过程的方式。 我不推荐它,因为它是一个冗长的脖子痛。 我建议简单地使用Python。

或者,您可以深入研究并发明自己的语法并编写自己的解析器。

Building a DSL to be interpreted by Python.

Step 1. Build the run-time classes and objects. These classes will have all the cursor loops and SQL statements and all of that algorithmic processing tucked away in their methods. You'll make heavy use of the Command and Strategy design patterns to build these classes. Most things are a command, options and choices are plug-in strategies. Look at the design for Apache Ant's Task API -- it's a good example.

Step 2. Validate that this system of objects actually works. Be sure that the design is simple and complete. You're tests will construct the Command and Strategy objects, and then execute the top-level Command object. The Command objects will do the work.

At this point you're largely done. Your run-time is just a configuration of objects created from the above domain. [This isn't as easy as it sounds. It requires some care to define a set of classes that can be instantiated and then "talk among themselves" to do the work of your application.]

Note that what you'll have will require nothing more than declarations. What's wrong with procedural? One you start to write a DSL with procedural elements, you find that you need more and more features until you've written Python with different syntax. Not good.

Further, procedural language interpreters are simply hard to write. State of execution, and scope of references are simply hard to manage.

You can use native Python -- and stop worrying about "getting out of the sandbox". Indeed, that's how you'll unit test everything, using a short Python script to create your objects. Python will be the DSL.

["But wait", you say, "If I simply use Python as the DSL people can execute arbitrary things." Depends on what's on the PYTHONPATH, and sys.path. Look at the site module for ways to control what's available.]

A declarative DSL is simplest. It's entirely an exercise in representation. A block of Python that merely sets the values of some variables is nice. That's what Django uses.

You can use the ConfigParser as a language for representing your run-time configuration of objects.

You can use JSON or YAML as a language for representing your run-time configuration of objects. Ready-made parsers are totally available.

You can use XML, too. It's harder to design and parse, but it works fine. People love it. That's how Ant and Maven (and lots of other tools) use declarative syntax to describe procedures. I don't recommend it, because it's a wordy pain in the neck. I recommend simply using Python.

Or, you can go off the deep-end and invent your own syntax and write your own parser.

流年已逝 2024-07-12 20:50:16

我想我们需要更多的信息。 如果以下任何内容基于错误的假设,请告诉我。

首先,正如您自己指出的,已经存在一种用于从任意表中选择行的 DSL——它称为“SQL”。 由于您不想重新发明 SQL,因此我假设您只需要从具有固定格式的单个表进行查询。

如果是这种情况,您可能不需要实现 DSL(尽管这肯定是一种方法); 如果您习惯于面向对象,那么创建 Filter 对象可能会更容易。

更具体地说,是一个包含一个或多个 SelectionCriterion 对象的“Filter”集合。 您可以实现这些以从表示选择类型的一个或多个基类(Range、LessThan、ExactMatch、Like 等)继承。一旦这些基类就位,您就可以创建适合该列的特定于列的继承版本。 最后,根据您想要支持的查询的复杂性,您将需要实现某种连接胶来处理各种条件之间的 AND、OR 和 NOT 链接。

如果您愿意,您可以创建一个简单的 GUI 来加载集合; 如果您没有其他想法,我会将 Excel 中的过滤视为模型。

最后,将此 Collection 的内容转换为相应的 SQL,并将其传递到数据库应该很简单。

但是:如果您追求的是简单性,并且您的用户了解 SQL,您可以简单地要求他们输入 WHERE 子句的内容,并以编程方式构建查询的其余部分。 从安全角度来看,如果您的代码可以控制所选的列和 FROM 子句,并且您的数据库权限设置正确,并且您对来自用户的字符串进行了一些健全性检查,那么这将是一个相对安全的选项。

I think we're going to need a bit more information here. Let me know if any of the following is based on incorrect assumptions.

First of all, as you pointed out yourself, there already exists a DSL for selecting rows from arbitrary tables-- it is called "SQL". Since you don't want to reinvent SQL, I'm assuming that you only need to query from a single table with a fixed format.

If this is the case, you probably don't need to implement a DSL (although that's certainly one way to go); it may be easier, if you are used to Object Orientation, to create a Filter object.

More specifically, a "Filter" collection that would hold one or more SelectionCriterion objects. You can implement these to inherit from one or more base classes representing types of selections (Range, LessThan, ExactMatch, Like, etc.) Once these base classes are in place, you can create column-specific inherited versions which are appropriate to that column. Finally, depending on the complexity of the queries you want to support, you'll want to implement some kind of connective glue to handle AND and OR and NOT linkages between the various criteria.

If you feel like it, you can create a simple GUI to load up the collection; I'd look at the filtering in Excel as a model, if you don't have anything else in mind.

Finally, it should be trivial to convert the contents of this Collection to the corresponding SQL, and pass that to the database.

However: if what you are after is simplicity, and your users understand SQL, you could simply ask them to type in the contents of a WHERE clause, and programmatically build up the rest of the query. From a security perspective, if your code has control over the columns selected and the FROM clause, and your database permissions are set properly, and you do some sanity checking on the string coming in from the users, this would be a relatively safe option.

半衬遮猫 2024-07-12 20:50:16

“实现领域特定语言”

“没有人会想要安装在运行时下载并执行任意 Python 代码的服务器”

我想要一个 DSL,但我不希望 Python 成为那个 DSL。 好的。 您将如何执行此 DSL? 如果不是 Python,什么运行时可以接受?

如果我有一个 C 程序恰好嵌入了 Python 解释器怎么办? 这是可以接受的吗?

而且——如果 Python 不是可接受的运行时——为什么它有一个 Python 标签?

"implement a Domain Specific Language"

"nobody is going to want to install a server that downloads and executes arbitrary Python code at runtime"

I want a DSL but I don't want Python to be that DSL. Okay. How will you execute this DSL? What runtime is acceptable if not Python?

What if I have a C program that happens to embed the Python interpreter? Is that acceptable?

And -- if Python is not an acceptable runtime -- why does this have a Python tag?

冬天的雪花 2024-07-12 20:50:16

为什么不创建一种语言,在“编译”时生成 SQL 或数据存储区所需的任何查询语言?

您基本上会在持久层上创建一个抽象。

Why not create a language that when it "compiles" it generates SQL or whatever query language your datastore requires ?

You would be basically creating an abstraction over your persistence layer.

新一帅帅 2024-07-12 20:50:16

你提到了Python。 为什么不使用Python? 如果有人可以在您的 DSL 中“输入”表达式,那么他们就可以输入 Python。

您需要一些关于表达式结构的规则,但这比实现新的东西容易得多。

You mentioned Python. Why not use Python? If someone can "type in" an expression in your DSL, they can type in Python.

You'll need some rules on structure of the expression, but that's a lot easier than implementing something new.

活泼老夫 2024-07-12 20:50:16

您说没有人会想要安装在运行时下载并执行任意代码的服务器。 然而,这正是您的 DSL(最终)要做的事情,因此可能没有太大区别。 除非你正在对数据做一些非常具体的事情,否则我认为 DSL 不会给你带来那么多好处,而且它会让已经精通 SQL 的用户感到沮丧。 不要低估您将要承担的任务的规模。

然而,要回答你的问题,你需要为你的语言想出一个语法,一些东西来解析文本并遍历树,发出代码或调用你编写的API(这就是为什么我评论说你是仍然需要发送一些代码)。

网上有很多关于数学表达式语法的教育文本,您可以参考,这相当简单。 您可能有像 ANTLR 或 Yacc 这样的解析器生成器工具,可以用来帮助您生成解析器(或者使用像 Lisp/Scheme 这样的语言并将两者结合起来)。 提出合理的 SQL 语法并不容易。 但是谷歌一下“BNF SQL”,看看你能得到什么。

祝你好运。

You said nobody is going to want to install a server that downloads and executes arbitrary code at runtime. However, that is exactly what your DSL will do (eventually) so there probably isn't that much of a difference. Unless you're doing something very specific with the data then I don't think a DSL will buy you that much and it will frustrate the users who are already versed in SQL. Don't underestimate the size of the task you'll be taking on.

To answer your question however, you will need to come up with a grammar for your language, something to parse the text and walk the tree, emitting code or calling an API that you've written (which is why my comment that you're still going to have to ship some code).

There are plenty of educational texts on grammars for mathematical expressions you can refer to on the net, that's fairly straight forward. You may have a parser generator tool like ANTLR or Yacc you can use to help you generate the parser (or use a language like Lisp/Scheme and marry the two up). Coming up with a reasonable SQL grammar won't be easy. But google 'BNF SQL' and see what you come up with.

Best of luck.

痴梦一场 2024-07-12 20:50:16

这听起来确实像 SQL,但如果您想保持简单,也许值得尝试使用 SQLite?

It really sounds like SQL, but perhaps it's worth to try using SQLite if you want to keep it simple?

拒绝两难 2024-07-12 20:50:16

上下文无关语法通常具有树状结构,函数式程序也具有树状结构。 我并不认为以下内容可以解决您的所有问题,但如果您确定不想使用像 SQLite3 这样的东西,那么这是朝着这个方向迈出的良好一步。

from functools import partial
def select_keys(keys, from_):
    return ({k : fun(v, row) for k, (v, fun) in keys.items()}
            for row in from_)

def select_where(from_, where):
    return (row for row in from_
            if where(row))

def default_keys_transform(keys, transform=lambda v, row: row[v]):
    return {k : (k, transform) for k in keys}

def select(keys=None, from_=None, where=None):
    """
    SELECT v1 AS k1, 2*v2 AS k2 FROM table WHERE v1 = a AND v2 >= b OR v3 = c

    translates to 

    select(dict(k1=(v1, lambda v1, r: r[v1]), k2=(v2, lambda v2, r: 2*r[v2])
        , from_=table
        , where= lambda r : r[v1] = a and r[v2] >= b or r[v3] = c)
    """
    assert from_ is not None
    idfunc = lambda k, t : t
    select_k = idfunc if keys is None  else select_keys
    if isinstance(keys, list):
        keys = default_keys_transform(keys)
    idfunc = lambda t, w : t
    select_w = idfunc if where is None else select_where
    return select_k(keys, select_w(from_, where))

如何确保不让用户执行任意代码。 该框架允许所有可能的功能。 好吧,为了安全起见,您可以在它上面设置一个包装器,公开可接受的函数对象的固定列表。

ALLOWED_FUNCS = [ operator.mul, operator.add, ...] # List of allowed funcs

def select_secure(keys=None, from_=None, where=None):
    if keys is not None and isinstance(keys, dict):
       for v, fun keys.values:
           assert fun in ALLOWED_FUNCS
    if where is not None:
       assert_composition_of_allowed_funcs(where, ALLOWED_FUNCS)
    return select(keys=keys, from_=from_, where=where)

如何编写 assert_composition_of_allowed_funcs。 在 python 中很难做到这一点,但在 lisp 中很容易做到。 让我们假设 where 是要以类似格式进行计算的函数列表,即 where=(operator.add, (operator.getitem, row, v1), 2)where =(operator.mul, (operator.add, (opreator.getitem, row, v2), 2), 3)

这使得编写一个 apply_lisp 函数成为可能,该函数确保 where 函数仅由 ALLOWED_FUNCS 或 float、int、str 等常量组成。

def apply_lisp(where, rowsym, rowval, ALLOWED_FUNCS):
    assert where[0] in ALLOWED_FUNCS
    return apply(where[0],
          [ (apply_lisp(w, rowsym, rowval, ALLOWED_FUNCS)
            if isinstance(w, tuple)
            else rowval if w is rowsym
            else w if isinstance(w, (float, int, str))
            else None ) for w in where[1:] ])

除此之外,您还需要检查确切的类型,因为您不希望您的类型被覆盖。 所以不要使用isinstance,而是使用type in (float, int, str)。 天哪,我们遇到了:

格林斯潘的第十条编程规则:任何足够复杂的规则
C 或 Fortran 程序包含一个临时的非正式指定的
Common Lisp 一半的错误缠身,执行缓慢。

A context-free grammar usually has a tree like structure and functional programs have a tree like structure too. I don't claim the following would solve all of your problems, but it is a good step in the direction if you are sure that you don't want to use something like SQLite3.

from functools import partial
def select_keys(keys, from_):
    return ({k : fun(v, row) for k, (v, fun) in keys.items()}
            for row in from_)

def select_where(from_, where):
    return (row for row in from_
            if where(row))

def default_keys_transform(keys, transform=lambda v, row: row[v]):
    return {k : (k, transform) for k in keys}

def select(keys=None, from_=None, where=None):
    """
    SELECT v1 AS k1, 2*v2 AS k2 FROM table WHERE v1 = a AND v2 >= b OR v3 = c

    translates to 

    select(dict(k1=(v1, lambda v1, r: r[v1]), k2=(v2, lambda v2, r: 2*r[v2])
        , from_=table
        , where= lambda r : r[v1] = a and r[v2] >= b or r[v3] = c)
    """
    assert from_ is not None
    idfunc = lambda k, t : t
    select_k = idfunc if keys is None  else select_keys
    if isinstance(keys, list):
        keys = default_keys_transform(keys)
    idfunc = lambda t, w : t
    select_w = idfunc if where is None else select_where
    return select_k(keys, select_w(from_, where))

How do you make sure that you are not giving users ability to execute arbitrary code. This framework admits all possible functions. Well, you can right a wrapper over it for security that expose a fixed list of function objects that are acceptable.

ALLOWED_FUNCS = [ operator.mul, operator.add, ...] # List of allowed funcs

def select_secure(keys=None, from_=None, where=None):
    if keys is not None and isinstance(keys, dict):
       for v, fun keys.values:
           assert fun in ALLOWED_FUNCS
    if where is not None:
       assert_composition_of_allowed_funcs(where, ALLOWED_FUNCS)
    return select(keys=keys, from_=from_, where=where)

How to write assert_composition_of_allowed_funcs. It is very difficult to do that it in python but easy in lisp. Let us assume that where is a list of functions to be evaluated in a lips like format i.e. where=(operator.add, (operator.getitem, row, v1), 2) or where=(operator.mul, (operator.add, (opreator.getitem, row, v2), 2), 3).

This makes it possible to write a apply_lisp function that makes sure that the where function is only made up of ALLOWED_FUNCS or constants like float, int, str.

def apply_lisp(where, rowsym, rowval, ALLOWED_FUNCS):
    assert where[0] in ALLOWED_FUNCS
    return apply(where[0],
          [ (apply_lisp(w, rowsym, rowval, ALLOWED_FUNCS)
            if isinstance(w, tuple)
            else rowval if w is rowsym
            else w if isinstance(w, (float, int, str))
            else None ) for w in where[1:] ])

Aside, you will also need to check for exact types, because you do not want your types to be overridden. So do not use isinstance, use type in (float, int, str). Oh boy we have run into:

Greenspun's Tenth Rule of Programming: any sufficiently complicated
C or Fortran program contains an ad hoc informally-specified
bug-ridden slow implementation of half of Common Lisp.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文