如何将这个非确定性 XML Schema 重写为确定性的?

发布于 2024-08-16 09:40:10 字数 771 浏览 15 评论 0原文

为什么这是不确定的以及如何解决它?

 <xs:element name="activeyears">
        <xs:complexType>
            <xs:sequence minOccurs="0" maxOccurs="1">
                <xs:sequence minOccurs="0" maxOccurs="unbounded">
                    <xs:element ref="from" minOccurs="1" maxOccurs="1"/>
                    <xs:element ref="till" minOccurs="1" maxOccurs="1"/>
                </xs:sequence>
                <xs:element ref="from" minOccurs="0" maxOccurs="1"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

它应该意味着 为空或包含以 < 开头的 序列/code> 但可以以任一结尾。

Why this is non-deterministic and how to fix it?

 <xs:element name="activeyears">
        <xs:complexType>
            <xs:sequence minOccurs="0" maxOccurs="1">
                <xs:sequence minOccurs="0" maxOccurs="unbounded">
                    <xs:element ref="from" minOccurs="1" maxOccurs="1"/>
                    <xs:element ref="till" minOccurs="1" maxOccurs="1"/>
                </xs:sequence>
                <xs:element ref="from" minOccurs="0" maxOccurs="1"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

It is supposed to mean that <activeyears> is either empty or contains sequence of <from><till> which starts with <from> but can end with either.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

人│生佛魔见 2024-08-23 09:40:10

当有两个以同一元素开头的分支时,模式是非确定性的 - 因此,如果不向前查看该元素,您就无法判断要采用哪个分支。一个简单的例子是 ab|ac - 当您看到 a 时,您不知道该选择哪个分支。对于循环,“分支”是是否重复循环,或者在循环之后继续。一个例子是a*a - 一旦你进入循环,并且你读取了a,你不知道是否要重复循环,或者继续。

查看您的示例架构,假设它刚刚解析了 ,现在需要解析 。您可以使用 循环以及最终的来解析它。仅通过查看 无法判断要使用哪个分支。你只能进一步展望未来才能判断。


坏消息:我认为您的示例模式非常罕见,它不可能确定性地表达!

以下是您想要接受的 XML 文档(我对每个元素使用一个字母,其中 a = ...b = ...:

*empty*
a
ab
aba
abab
ababa
ababab
...

... 你明白了,问题是任何字母都可以是最后一个字母。序列它可以是循环的一部分,除非提前查看以下字母,否则无法确定它是什么。因为“确定性”意味着您不这样做。这种前瞻(根据定义),您想要的语言无法确定性地表达。

为了简化您的架构,它尝试了类似于 (ab)*a? 的方法 - 但两个分支都以 a 开头。。另一种方法是 a(ba)*b? - 现在两个分支都以 b 开头,从技术上讲,我们无法赢得

所有文档的集合 !模式将接受的称为该模式的语言。如果不存在可以表达语言的确定性模式,则该语言称为“单歧义”。

有关理论讨论,请参阅 Bruggemann-Klein 的系列论文(例如 确定性正则语言单一明确的正则语言)。
她包括对单一明确语言的正式测试。

A schema is non-deterministic when there are two branches that begin with the same element - so that you cannot tell which branch to take without looking ahead after that element. A simple example is ab|ac - when you see an a, you don't know which branch to take. For loops, the "branch" is whether to repeat the loop, or continue after it. An example of this is a*a - once you are in the loop, and you read an a, you don't know whether to repeat the loop, or continue.

Looking at your example schema, imagine that it has just parsed a <till>, and now it needs to parse a <from>. You could parse it with the <from><till> loop or with the final <from>. You can't tell which branch to use, just by looking at that <from>. You can only tell with further looking-ahead.


Bad news: I think your example schema is a very rare one, that it is impossible to express deterministically!

Here are the XML documents you want to accept (I'm using a single letter for each element, where a = <from>...</from> and b = <to>...</to>:

*empty*
a
ab
aba
abab
ababa
ababab
...

... you get the idea. The problem is that any letter can be the final letter in the sequence or it can be part of the loop. There is no way to tell which it will be, except by looking-ahead at the following letter. Since "deterministic" means that you don't do this lookahead (by definition), the language that you want cannot be expressed deterministically.

Simplifying your schema, it tries an approach similar to (ab)*a? - but both branches start with a. Another approach is a(ba)*b? - now both branches start with b. We can't win!

Technically, the set of all documents that a schema will accept is called that schema's language. If no deterministic schema exists that can express a language, the language is called "one-ambiguous".

For a theoretic discussion, see the series of papers by Bruggemann-Klein (e.g. Deterministic Regular Languages and One-Unambiguous Regular Languages).
She includes a formal test for one-unambiguous languages.

深空失忆 2024-08-23 09:40:10

这是对代码的简单编辑;我还没有尝试过:

 <xs:element name="activeyears">
        <xs:complexType>
            <xs:sequence minOccurs="0" maxOccurs="1">
                <xs:element ref="from" minOccurs="1" maxOccurs="1"/>
                <xs:sequence minOccurs="0" maxOccurs="unbounded">
                    <xs:element ref="till" minOccurs="1" maxOccurs="1"/>
                    <xs:element ref="from" minOccurs="0" maxOccurs="1"/>
                </xs:sequence>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

一些背景:XML 模式是一种非常简单的语法,模式处理器是一个解析器,尝试将此语法的规则应用于输入文件。然而,与传统编译器使用的解析器不同,XML 模式没有前瞻功能。因此,两个规则不能共享相同的初始标记集(元素名称)。

因此,我所做的具体更改是:

  • 我保持外部序列不变;它控制“空或有特定内容”的要求。
  • 如果有内容,必须以“from”开头;所以我制作了序列中的第一个 element ,并具有显式的出现次数。
  • 由于我使用“from”作为显式元素,因此我必须反转子序列的顺序。
  • 除非您想指定每个“till”后面必须跟一个“from”,否则您需要放宽子序列中的 minOccurs
  • 子序列还处理单个 from/till 的情况 - 正如评论者指出的那样,我使用 minOccurs='0' 进行的第二次编辑允许使用两个“till”的终止序列。

This is a simple edit of your code; I haven't tried it:

 <xs:element name="activeyears">
        <xs:complexType>
            <xs:sequence minOccurs="0" maxOccurs="1">
                <xs:element ref="from" minOccurs="1" maxOccurs="1"/>
                <xs:sequence minOccurs="0" maxOccurs="unbounded">
                    <xs:element ref="till" minOccurs="1" maxOccurs="1"/>
                    <xs:element ref="from" minOccurs="0" maxOccurs="1"/>
                </xs:sequence>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

Some background: XML schema is a very simple grammar, and the schema processor is a parser that attempts to apply the rules of this grammar to the input file. Unlike the parsers used by traditional compilers, however, XML schema has no lookahead. So you can't have two rules that share the same initial set of tokens (element names).

So, the specific changes that I made:

  • I left your outer sequence unchanged; it controls the "empty or has specific content" requirement.
  • If there is content, it must start with "from"; so I made that the first element in the sequence, with explicit occurrence count
  • Since I used "from" as an explicit element, I had to reverse the order of the subsequence.
  • And unless you want to specify that every "till" must be followed by a "from", you need to relax the minOccurs in the subsequence.
  • The subsequence also handles the case of a single from/till -- as a commenter noted, my second edit with the minOccurs='0' allowed a terminating sequence of two "till"s.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文