如何将这个非确定性 XML Schema 重写为确定性的?
为什么这是不确定的以及如何解决它?
<xs:element name="activeyears">
<xs:complexType>
<xs:sequence minOccurs="0" maxOccurs="1">
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element ref="from" minOccurs="1" maxOccurs="1"/>
<xs:element ref="till" minOccurs="1" maxOccurs="1"/>
</xs:sequence>
<xs:element ref="from" minOccurs="0" maxOccurs="1"/>
</xs:sequence>
</xs:complexType>
</xs:element>
它应该意味着
为空或包含以
序列/code> 但可以以任一结尾。
Why this is non-deterministic and how to fix it?
<xs:element name="activeyears">
<xs:complexType>
<xs:sequence minOccurs="0" maxOccurs="1">
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element ref="from" minOccurs="1" maxOccurs="1"/>
<xs:element ref="till" minOccurs="1" maxOccurs="1"/>
</xs:sequence>
<xs:element ref="from" minOccurs="0" maxOccurs="1"/>
</xs:sequence>
</xs:complexType>
</xs:element>
It is supposed to mean that <activeyears>
is either empty or contains sequence of <from><till>
which starts with <from>
but can end with either.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当有两个以同一元素开头的分支时,模式是非确定性的 - 因此,如果不向前查看该元素,您就无法判断要采用哪个分支。一个简单的例子是
ab|ac
- 当您看到a
时,您不知道该选择哪个分支。对于循环,“分支”是是否重复循环,或者在循环之后继续。一个例子是a*a
- 一旦你进入循环,并且你读取了a
,你不知道是否要重复循环,或者继续。查看您的示例架构,假设它刚刚解析了来解析它。仅通过查看
,现在需要解析
。您可以使用
循环或以及最终的
无法判断要使用哪个分支。你只能进一步展望未来才能判断。坏消息:我认为您的示例模式非常罕见,它不可能确定性地表达!
以下是您想要接受的 XML 文档(我对每个元素使用一个字母,其中... :
a
=...
和b
=... 你明白了,问题是任何字母都可以是最后一个字母。序列或它可以是循环的一部分,除非提前查看以下字母,否则无法确定它是什么。因为“确定性”意味着您不这样做。这种前瞻(根据定义),您想要的语言无法确定性地表达。
为了简化您的架构,它尝试了类似于
(ab)*a?
的方法 - 但两个分支都以a 开头。
。另一种方法是a(ba)*b?
- 现在两个分支都以b
开头,从技术上讲,我们无法赢得所有文档的集合 !模式将接受的称为该模式的语言。如果不存在可以表达语言的确定性模式,则该语言称为“单歧义”。
有关理论讨论,请参阅 Bruggemann-Klein 的系列论文(例如 确定性正则语言 和单一明确的正则语言)。
她包括对单一明确语言的正式测试。
A schema is non-deterministic when there are two branches that begin with the same element - so that you cannot tell which branch to take without looking ahead after that element. A simple example is
ab|ac
- when you see ana
, you don't know which branch to take. For loops, the "branch" is whether to repeat the loop, or continue after it. An example of this isa*a
- once you are in the loop, and you read ana
, you don't know whether to repeat the loop, or continue.Looking at your example schema, imagine that it has just parsed a
<till>
, and now it needs to parse a<from>
. You could parse it with the<from><till>
loop or with the final<from>
. You can't tell which branch to use, just by looking at that<from>
. You can only tell with further looking-ahead.Bad news: I think your example schema is a very rare one, that it is impossible to express deterministically!
Here are the XML documents you want to accept (I'm using a single letter for each element, where
a
=<from>...</from>
andb
=<to>...</to>
:... you get the idea. The problem is that any letter can be the final letter in the sequence or it can be part of the loop. There is no way to tell which it will be, except by looking-ahead at the following letter. Since "deterministic" means that you don't do this lookahead (by definition), the language that you want cannot be expressed deterministically.
Simplifying your schema, it tries an approach similar to
(ab)*a?
- but both branches start witha
. Another approach isa(ba)*b?
- now both branches start withb
. We can't win!Technically, the set of all documents that a schema will accept is called that schema's language. If no deterministic schema exists that can express a language, the language is called "one-ambiguous".
For a theoretic discussion, see the series of papers by Bruggemann-Klein (e.g. Deterministic Regular Languages and One-Unambiguous Regular Languages).
She includes a formal test for one-unambiguous languages.
这是对代码的简单编辑;我还没有尝试过:
一些背景:XML 模式是一种非常简单的语法,模式处理器是一个解析器,尝试将此语法的规则应用于输入文件。然而,与传统编译器使用的解析器不同,XML 模式没有前瞻功能。因此,两个规则不能共享相同的初始标记集(元素名称)。
因此,我所做的具体更改是:
序列
不变;它控制“空或有特定内容”的要求。element
,并具有显式的出现次数。minOccurs
。minOccurs='0'
进行的第二次编辑允许使用两个“till”的终止序列。This is a simple edit of your code; I haven't tried it:
Some background: XML schema is a very simple grammar, and the schema processor is a parser that attempts to apply the rules of this grammar to the input file. Unlike the parsers used by traditional compilers, however, XML schema has no lookahead. So you can't have two rules that share the same initial set of tokens (element names).
So, the specific changes that I made:
sequence
unchanged; it controls the "empty or has specific content" requirement.element
in the sequence, with explicit occurrence countminOccurs
in the subsequence.minOccurs='0'
allowed a terminating sequence of two "till"s.