XDocument.Save() 删除我的 
实体

发布于 2024-12-26 07:31:16 字数 346 浏览 2 评论 0原文

我编写了一个工具来使用 C# 和 Linq-to-XML 修复一些 XML 文件(即插入一些丢失的属性/值)。该工具将现有 XML 文件加载到 XDocument 对象中。然后,它向下解析节点以插入丢失的数据。之后,它调用 XDocument.Save() 将更改保存到另一个目录。

除了一件事之外,所有这些都很好:XML 文件中文本中的任何 
 实体都将替换为新行字符。当然,该实体代表一个新行,但我需要在 XML 中保留该实体,因为另一个使用者需要它。

有没有办法保存修改后的 XDocument 而不丢失 
 实体?

谢谢。

I wrote a tool to repair some XML files (i.e., insert some attributes/values that were missing) using C# and Linq-to-XML. The tool loads an existing XML file into an XDocument object. Then, it parses down through the node to insert the missing data. After that, it calls XDocument.Save() to save the changes out to another directory.

All of that is just fine except for one thing: any entities that are in the text in the XML file are replaced with a new line character. The entity represents a new line, of course, but I need to preserve the entity in the XML because another consumer needs it in there.

Is there any way to save the modified XDocument without losing the entities?

Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

幸福%小乖 2025-01-02 07:31:16


实体在 XML 中技术上称为“数字字符引用”,它们在原始文档加载到 XDocument 中时得到解析。这使得您的问题很难解决,因为在加载 XDocument 后,无法区分已解析的空白实体和无关紧要的空白(通常用于为纯文本查看器格式化 XML 文档)。因此,以下内容仅适用于您的文档没有任何无关紧要的空格的情况。

System.Xml 库允许通过设置 XmlWriterSettings 类的 NewLineHandling 属性设置为 Entitize。但是,在文本节点内,这只会将 \r 实体化为 ,而不是将 \n 实体化为 。 #xA;。

最简单的解决方案是从 XmlWriter 类派生并覆盖其 WriteString 方法手动将空白字符替换为其数字字符实体。 WriteString 方法也恰好是 .NET 将不允许出现在文本节点中的字符实体化的地方,例如语法标记 &< ;>,分别实体化为&<>

由于 XmlWriter 是抽象的,我们将从 XmlTextWriter 派生,以避免必须实现前一个类的所有抽象方法。这是一个快速而肮脏的实现:

public class EntitizingXmlWriter : XmlTextWriter
{
    public EntitizingXmlWriter(TextWriter writer) :
        base(writer)
    { }

    public override void WriteString(string text)
    {
        foreach (char c in text)
        {
            switch (c)
            {
                case '\r':
                case '\n':
                case '\t':
                    base.WriteCharEntity(c);
                    break;
                default:
                    base.WriteString(c.ToString());
                    break;
            }
        }
    }
}

如果打算在生产环境中使用,您需要删除 c.ToString() 部分,因为它的效率非常低。您可以通过批处理原始 text 中不包含任何您想要实体化的字符的子字符串,并将它们一起输入到单个 base.WriteString 调用中来优化代码。

警告:以下简单的实现将不起作用,因为基本 WriteString 方法会将任何 & 字符替换为 & ,从而导致 \r 扩展为

    public override void WriteString(string text)
    {
        text = text.Replace("\r", "
");
        text = text.Replace("\n", "
");
        text = text.Replace("\t", "	");
        base.WriteString(text);
    }

最后,要将您的 XDocument 保存到目标文件或流中,只需使用以下代码片段:

using (var textWriter = new StreamWriter(destination))
using (var xmlWriter = new EntitizingXmlWriter(textWriter))
    document.Save(xmlWriter);

希望这会有所帮助!

编辑:作为参考,这里是重写的 WriteString 方法的优化版本:

public override void WriteString(string text)
{
    // The start index of the next substring containing only non-entitized characters.
    int start = 0;

    // The index of the current character being checked.
    for (int curr = 0; curr < text.Length; ++curr)
    {
        // Check whether the current character should be entitized.
        char chr = text[curr];
        if (chr == '\r' || chr == '\n' || chr == '\t')
        {
            // Write the previous substring of non-entitized characters.
            if (start < curr)
                base.WriteString(text.Substring(start, curr - start));

            // Write current character, entitized.
            base.WriteCharEntity(chr);

            // Next substring of non-entitized characters tentatively starts
            // immediately beyond current character.
            start = curr + 1;
        }
    }

    // Write the trailing substring of non-entitized characters.
    if (start < text.Length)
        base.WriteString(text.Substring(start, text.Length - start));
}

The entities are technically called “numeric character references” in XML, and they are resolved when the original document is loaded into the XDocument. This makes your issue problematic to solve, since there is no way of distinguishing resolved whitespace entities from insignificant whitespace (typically used for formatting XML documents for plain-text viewers) after the XDocument has been loaded. Thus, the below only applies if your document does not have any insignificant whitespace.

The System.Xml library allows one to preserve whitespace entities by setting the NewLineHandling property of the XmlWriterSettings class to Entitize. However, within text nodes, this would only entitize \r to , and not \n to .

The easiest solution is to derive from the XmlWriter class and override its WriteString method to manually replace the whitespace characters with their numeric character entities. The WriteString method also happens to be the place where .NET entitizes characters that are not permitted to appear in text nodes, such as the syntax markers &, <, and >, which are respectively entitized to &, <, and >.

Since XmlWriter is abstract, we shall derive from XmlTextWriter in order to avoid having to implement all the abstract methods of the former class. Here is a quick-and-dirty implementation:

public class EntitizingXmlWriter : XmlTextWriter
{
    public EntitizingXmlWriter(TextWriter writer) :
        base(writer)
    { }

    public override void WriteString(string text)
    {
        foreach (char c in text)
        {
            switch (c)
            {
                case '\r':
                case '\n':
                case '\t':
                    base.WriteCharEntity(c);
                    break;
                default:
                    base.WriteString(c.ToString());
                    break;
            }
        }
    }
}

If intended for use in a production environment, you’d want to do away with the c.ToString() part, since it’s very inefficient. You can optimize the code by batching substrings of the original text that do not contain any of the characters you want to entitize, and feeding them together into a single base.WriteString call.

A word of warning: The following naive implementation will not work, since the base WriteString method would replace any & characters with &, thereby causing \r to be expanded to &#xA;.

    public override void WriteString(string text)
    {
        text = text.Replace("\r", "
");
        text = text.Replace("\n", "
");
        text = text.Replace("\t", "	");
        base.WriteString(text);
    }

Finally, to save your XDocument into a destination file or stream, just use the following snippet:

using (var textWriter = new StreamWriter(destination))
using (var xmlWriter = new EntitizingXmlWriter(textWriter))
    document.Save(xmlWriter);

Hope this helps!

Edit: For reference, here is an optimized version of the overridden WriteString method:

public override void WriteString(string text)
{
    // The start index of the next substring containing only non-entitized characters.
    int start = 0;

    // The index of the current character being checked.
    for (int curr = 0; curr < text.Length; ++curr)
    {
        // Check whether the current character should be entitized.
        char chr = text[curr];
        if (chr == '\r' || chr == '\n' || chr == '\t')
        {
            // Write the previous substring of non-entitized characters.
            if (start < curr)
                base.WriteString(text.Substring(start, curr - start));

            // Write current character, entitized.
            base.WriteCharEntity(chr);

            // Next substring of non-entitized characters tentatively starts
            // immediately beyond current character.
            start = curr + 1;
        }
    }

    // Write the trailing substring of non-entitized characters.
    if (start < text.Length)
        base.WriteString(text.Substring(start, text.Length - start));
}
电影里的梦 2025-01-02 07:31:16

如果您的文档包含无关紧要的空格,您希望将其与 实体区分开来,则可以使用以下(更简单)的解决方案:转换 code> 字符临时引用另一个字符(文档中尚不存在),执行 XML 处理,然后将该字符转换回输出结果中。在下面的示例中,我们将使用私有字符U+E800

static string ProcessXml(string input)
{
    input = input.Replace("
", "");
    XDocument document = XDocument.Parse(input);
    // TODO: Perform XML processing here.
    string output = document.ToString();
    return output.Replace("\uE800", "
");
}

请注意,由于 XDocument 将数字字符引用解析为相应的 Unicode 字符,因此 "" 实体将被解析为 '\uE800' 在输出中。

通常,您可以安全地使用 Unicode“专用区域”(U+E000U+F8FF) 中的任何代码点。如果您想更加安全,请检查该字符是否已存在于文档中;如果是这样,请从上述范围中选择另一个字符。由于您只是暂时在内部使用该角色,因此使用哪一个并不重要。在极不可能的情况下,所有专用字符都已存在于文档中,则抛出异常;然而,我怀疑这在实践中是否会发生。

If your document contains insignificant whitespace which you want to distinguish from your entities, you can use the following (much simpler) solution: Convert the character references temporarily to another character (that is not already present in your document), perform your XML processing, and then convert the character back in the output result. In the example below, we shall use the private character U+E800.

static string ProcessXml(string input)
{
    input = input.Replace("
", "");
    XDocument document = XDocument.Parse(input);
    // TODO: Perform XML processing here.
    string output = document.ToString();
    return output.Replace("\uE800", "
");
}

Note that, since XDocument resolves numeric character references to their corresponding Unicode characters, the "" entities would have been resolved to '\uE800' in the output.

Typically, you can safely use any codepoint from the Unicode’s “Private Use Area” (U+E000U+F8FF). If you want to be extra safe, perform a check that the character is not already present in the document; if so, pick another character from the said range. Since you’ll only be using the character temporarily and internally, it does not matter which one you use. In the very unlikely scenario that all private use characters are already present in the document, throw an exception; however, I doubt that that will ever happen in practice.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文