编码 CDATA 元素的有效方法

发布于 2024-07-13 18:52:52 字数 567 浏览 5 评论 0原文

好的,我正在使用 StreamReader 从流中读取数据。 流中的数据不是 xml,它可以是任何内容。

基于输入 StreamReader,我使用 XmlTextWriter 写入输出流。 基本上,总而言之,输出流包含来自输​​入流的数据,这些数据包装在父元素中包含的元素中。

我的问题是双重的。 数据以块的形式从输入流中读取,并且 StreamReader 类返回 char[]。 如果输入流中的数据包含“]]>” 它需要分成两个 CDATA 元素。 首先,如何搜索“]]>” 在字符数组中? 其次,因为我正在分块阅读,所以“]]>” 子字符串可以分为两个块,那么我该如何解释呢?

我可能可以将 char[] 转换为字符串,然后对其进行搜索替换。 这将解决我的第一个问题。 在每次读取时,我还可以检查最后一个字符是否是“]”,以便在下一次读取时,前两个字符是否是“]>” 我将开始一个新的 CDATA 部分。

这看起来效率很低,因为它涉及将 char 数组转换为字符串,这意味着花费时间来复制数据,并占用两倍的内存。 有没有更有效的方法,无论是速度还是记忆力?

Ok, I'm reading data from a stream using a StreamReader. The data inside the stream is not xml, it could be anything.

Based on the input StreamReader I'm writing to an output stream using an XmlTextWriter. Basically, when all is said and done, the output stream contains data from the input stream wrapped in a element contained in a parent element.

My problem is twofold. Data gets read from the input stream in chunks, and the StreamReader class returns char[]. If data in the input stream contains a "]]>" it needs to be split across two CDATA elements. First, how do I search for "]]>" in a char array? And second, because I'm reading in chunks, the "]]>" substring could be split across two chunks, so how do I account for this?

I could probably convert the char[] to a string, and do a search replace on it. That would solve my first problem. On each read, I could also check to see if the last character was a "]", so that on the next read, if the first two characters are "]>" I would start a new CDATA section.

This hardly seems efficient because it involves converting the char array to a string, which means spending time to copy the data, and eating up twice the memory. Is there a more efficient way, both speedwise and memory wise?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

无畏 2024-07-20 18:52:52

根据如何在生成 XML 时避免被称为笨蛋

不要理会 CDATA 部分

XML提供了两种转义方式
标记有效字符:
预定义实体和 CDATA
部分。 CDATA 部分仅
语法糖。 两种选择
句法结构没有语义
差异。

当您
正在手动编辑 XML 并且需要
粘贴一大块文本
包括标记有效字符
(例如代码示例)。 然而,当
使用序列化器生成 XML,
序列化器负责转义
自动地并试图
微观管理逃避的选择
方法只开辟了可能性
错误。
...
仅<、>、& 和(在属性值中)“需要转义。

只要对一小组特殊字符进行编码/转义,它就应该可以工作。

您是否必须自己处理转义是另一回事,但肯定更直接 - 。

然后只需将整个内容作为子文本节点附加到相关的 XML 元素即可

According to HOWTO Avoid Being Called a Bozo When Producing XML:

Don’t bother with CDATA sections

XML provides two ways of escaping
markup-significant characters:
predefined entities and CDATA
sections. CDATA sections are only
syntactic sugar. The two alternative
syntactic constructs have no semantic
difference.

CDATA sections are convenient when you
are editing XML manually and need to
paste a large chunk of text that
includes markup-significant characters
(eg. code samples). However, when
producing XML using a serializer, the
serializer takes care of escaping
automatically and trying to
micromanage the choice of escaping
method only opens up possibilities for
bugs.
...
Only <, >, & and (in attribute values) " need escaping.

So long as the small set of special characters are encoded/escaped it should just work.

Whether you have to handle the escaping yourself is a different matter, but certainly a much more straightforward-to-solve problem.

Then just append the whole lot as a child text node to the relevant XML element.

染墨丶若流云 2024-07-20 18:52:52

我知道 CDATA 的两个实际用例:

一个是在包含脚本的 XHTML 文档中:

<script type="text/javascript">
<![CDATA[
   function foo()
   {
      alert("You don't want <this> text escaped.");
   }
]]>
</script>

另一个是在手工编写的 XML 文档中,其中文本包含嵌入的标记,例如:

<p>
   A typical XML element looks like this:
</p>
<p>
   <pre>
   <![CDATA[
      <sample>
         <text>
            I'm using CDATA here so that I don't have to manually escape
            all of the special characters in this example.
         </text>
      </sample>
   ]]>
   </pre>
</p>

在所有其他情况下,只需让 DOM(或XmlWriter,或者任何您用来创建 XML 的工具)转义文本节点都可以正常工作。

I know of exactly two real use cases for CDATA:

One is in an XHTML document containing script:

<script type="text/javascript">
<![CDATA[
   function foo()
   {
      alert("You don't want <this> text escaped.");
   }
]]>
</script>

The other is in hand-authored XML documents where the text contains embedded markup, e.g.:

<p>
   A typical XML element looks like this:
</p>
<p>
   <pre>
   <![CDATA[
      <sample>
         <text>
            I'm using CDATA here so that I don't have to manually escape
            all of the special characters in this example.
         </text>
      </sample>
   ]]>
   </pre>
</p>

In all other cases, just letting the DOM (or the XmlWriter, or whatever tool you're using to create the XML) escape the text nodes works just fine.

执笔绘流年 2024-07-20 18:52:52

第二,因为我正在分块阅读,所以“]]>” 子字符串可以分为两个块,那么我该如何解释呢?

事实上,您必须保留队列中的最后两个字符,而不是立即将它们吐出。 然后,当新输入进入时,将其附加到队列中,并再次获取除最后两个字符之外的所有字符,对它们进行搜索和替换,然后输出。

更好的是:根本不用担心 CDATA 部分。 它们只是为了方便手工创作而存在。 如果您已经在进行搜索和替换,那么您没有理由不只搜索和替换 '<'、'>' 和“&” 及其预定义实体,并将它们包含在普通文本节点中。 由于这些是简单的单字符替换,因此您无需担心缓冲。

但是:如果您像您所说的那样使用 XmlTextWriter,那么就像为每个传入文本块调用 WriteString() 一样简单。

second, because I'm reading in chunks, the "]]>" substring could be split across two chunks, so how do I account for this?

Indeed, you would have to keep back the last two characters in a queue instead of spitting them out immediately. Then when new input comes in, append it to the queue and again take all but the last two characters, search-and-replace over them, and output.

Better: don't bother with a CDATA section at all. They're only there for the convenience of hand-authoring. If you're already doing search-and-replace, there's no reason you shouldn't just search-and-replace ‘<’, ‘>’ and ‘&’ with their predefined entities, and include those in a normal Text node. Since those are simple single-character replacements, you don't need to worry about buffering.

But: if you're using an XmlTextWriter as you say, it's as simple as calling WriteString() on it for each chunk of incoming text.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文