只要 XHTML 实体编码包含在 CDATA 标记内，它们在 XML 文档中就有效吗？

发布于 2024-07-15 12:51:16 字数 1025 浏览 8 评论 0原文

这是一个有效的（格式良好的）XML 文档吗？

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>&copy;</inner>
</outer>

问题在于 HTML/XHTML“©”实体编码在没有 DTD 或模式来定义它的 XML 文档中是否有效。表达上述内容的另一种方式是这样说：

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>&#169;</inner>
</outer>

这似乎是具有 UTF-8 编码的有效 XML。

但这是否有效：

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><![CDATA[&copy;]]></inner>
</outer>

上面的作者打算向 XML 解析器指示它应该将上面的版权符号作为字符串“©”传递而不是作为正确的 Unicode 字符。

在这方面，我发现这句话有点令人困惑：“XML 文档的新作者经常误解 CDATA 部分的目的，错误地认为其目的是“保护”数据在处理过程中不被视为普通字符数据。 [但是]字符数据就是字符数据，无论它是通过 CDATA 部分还是普通标记来表达。”（来自维基百科）

我正在单独查看第二位作者提出的 XML 格式，该作者将每个标签包装在 CDATA 部分中，即使标签只能包含数字。

希望XML 专家可以帮助消除对 CDATA 用途的困惑，

谢谢！

原文

Is this a valid (well-formed) XML document?

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>©</inner>
</outer>

At issue is whether the HTML/XHTML "©" entity encoding is valid in an XML document where there is no DTD or schema to define it. An alternative way of expressing the above would be to say this:

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>©</inner>
</outer>

Which would seem to be valid XML with a UTF-8 encoding.

But is this valid:

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><![CDATA[©]]></inner>
</outer>

The author of the above intends to indicate to the XML parser that it should pass through the copyright symbol above as the string "©" rather than as a proper Unicode character.

In that respect I find this quote a little confusing: 'New authors of XML documents often misunderstand the purpose of a CDATA section, mistakenly believing that its purpose is to "protect" data from being treated as ordinary character data during processing. [But] Character data is character data, regardless of whether it is expressed via a CDATA section or ordinary markup." (From Wikipedia)

I am seperately looking at a proposed XML format from a second author who has wrapped every tag in CDATA sections even when the tag can, for example, only contain digits.

Hope an XML guru can help clear up the confusion on the purpose of CDATA.

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱，才寂寞 2024-07-22 12:51:16

CDATA 部分的目的是允许通常在 XML 文档中以特殊方式解释的文字文本。也就是说，看起来像实体引用的东西，或者看起来像 XML 标签的东西。 CDATA 部分中的任何内容都可以位于有效的 XML 中，无需 CDATA 部分；您只需要使用实体引用对各种特殊字符进行编码，这样它们就不会被视为 XML 标记，而是被视为作为标记值的字符数据。

所以是的，只要是您想要的，以下内容是完全有效的：

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><![CDATA[©]]></inner>
</outer>

这里， inner 元素的值是值 © ，它不会由 XML 解析器解释为版权符号的实体引用。您还可以执行以下操作：

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><![CDATA[<normally> this looks <like/> & xml </normally>]]></inner>
</outer>

其中 inner 元素的值是

<normally> this looks <like/> & xml </normally>

To do this without a CDATA section:

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><normally> this looks <like/> &amp; xml </normally></inner>
</outer>

这不太可读，但就 XML 解析器而言是等效的。如果您这样做（假设 inner 元素被定义为包含字符串而不是 XML 的架构或 DTD），那么您的 XML 解析器将会抱怨：

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><normally> this looks <like/> & xml </normally></inner>
</outer>

因此您使用 CDATA 或实体转义来保护来自 XML 解析器的特殊字符，以便 XML 数据的客户端可以获得恰好包含 XML 标记字符的 inner 值。

注意：需要明确的是，上面的示例是格式良好的 XML，但如果架构或 DTD 表明元素 inner 包含 xsd:string 或等效项，那么它是一个无效 XML 文档。

不，未定义为 XML 本身一部分的 HTML 或 XHTML 实体不是有效的 XML，除非对其进行定义。您的 XML 解析器将返回错误。

A CDATA section is for the purpose of allowing literal text that would normally be interpreted in a special way in an XML document. That is, something that looks like an entity reference, or something that looks like XML tags. Anything in a CDATA section can be inside valid XML without a CDATA section; you'll just need to use entity references to encode the various special characters so they won't be treated as XML markup, but as character data that is the value of a tag.

So yes, the following is perfectly valid, as long as it is what you intend:

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><![CDATA[©]]></inner>
</outer>

Here, the value of the inner element is the value © which will not be interpreted by the XML parser as the entity reference for the copyright symbol. You can also do the following:

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><![CDATA[<normally> this looks <like/> & xml </normally>]]></inner>
</outer>

where the value for the inner element is

<normally> this looks <like/> & xml </normally>

To do this without a CDATA section:

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><normally> this looks <like/> &amp; xml </normally></inner>
</outer>

which is much less human-readable, but equivalent as far as an XML parser is concerned. If you did this (assuming that the inner element is defined an a schema or DTD as containing a string and not XML) then your XML parser will complain:

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><normally> this looks <like/> & xml </normally></inner>
</outer>

so you use the CDATA or entity escaping to protect the special characters from the XML parser so the client of the XML data can get the value of inner which happens to contain XML markup characters.

Note: To be clear, the above example is well formed XML, but if the schema or DTD says that the element inner contains xsd:string or equivalent, then it is an invalid XML document.

And no, HTML or XHTML entities that are not defined as part of XML itself are not valid XML unless they are defined. Your XML parser will return an error.

回复收藏 0 原文

同展鸳鸯锦 2024-07-22 12:51:16

艾迪给出了很好的答复，我只是补充了一些他显然没有提到的观点。

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>©></inner>
</outer>

不合法（实体“copy”未预定义，仅“lt”、“gt”和
XML 中的“quot”是）。

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>©</inner>
</outer>

是完全合法的，并且可能会提供您想要的（版权
象征）。

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><![CDATA[©]]></inner>
</outer>

也是完全合法的，但产生了完全不同的结果（
元素将包含六个 Unicode 字符，而不是
前面的例子）。

<?xml version="1.0" encoding="UTF-8" ?> 
<!DOCTYPE outer[
<!ENTITY copy "©">
]>
<outer>
  <inner>©></inner>
</outer>

也是合法的，并且给出与第二个示例相同的结果。它可以
避免您输入一些您使用但不容易输入的字符
使用键盘/编辑器生成。

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>©</inner>
</outer>

也是合法的（因为编码=“UTF-8”，编码=“US-ASCII”，它
是不可能的），并给出相同的结果。前提是你的
键盘/编辑器允许您直接使用该字符。

Eddie gave a good reply, I just complete on some points that he apparently did not mention.

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>©></inner>
</outer>

is not legal (entity "copy" is not predefined, only "lt", "gt" and
"quot" are, in XML).

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>©</inner>
</outer>

is perfectly legal and probably gives what you want (a copyright
symbol).

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner><![CDATA[©]]></inner>
</outer>

is also perfectly legal but yields a quite different result (the
element <inner> will contain six Unicode characters, instead of one in
the previous example).

<?xml version="1.0" encoding="UTF-8" ?> 
<!DOCTYPE outer[
<!ENTITY copy "©">
]>
<outer>
  <inner>©></inner>
</outer>

is legal, too, and gives the same result as the second example. It can
save you from typing some characters that you use but are not easy to
generate with your keyboard/editor.

<?xml version="1.0" encoding="UTF-8" ?> 
<outer>
  <inner>©</inner>
</outer>

is legal, too (because encoding="UTF-8", with encoding="US-ASCII", it
would have been impossible), and gives the same result. Providing that your
keyboard/editor allows you to use directly this character.

回复收藏 0 原文

月下客 2024-07-22 12:51:16

CDATA 块的内容会被 XML 解析器忽略，因此就验证和可解析性而言，您可以将任何内容放入 CDATA 中。

当然，这也伴随着 CDATA 被视为任意这一事实，因此如果您希望在 XML 中包含实际的 ©，则这是行不通的。我们假设您计划将 CDATA 的内容加载到 X/HTML 解析器中，就像您可能将图像中的一团 Base64 编码的二进制数据加载到图像解析器中一样。 XML 解析器不会尝试从 CDATA 块的内容中获取含义；而是尝试从 CDATA 块的内容中获取含义。它可能会像 © 一样说“foo”。

维基百科的引述似乎措辞令人困惑。

回复收藏 0 原文

~没有更多了~