如何使用 C# 从 XML 中删除重复属性

发布于 2024-11-19 01:09:01 字数 2152 浏览 2 评论 0原文

我正在解析来自第三方提供商的一些 XML 文件,不幸的是它并不总是格式良好的 XML,因为有时某些元素包含重复的属性。

我无法控制源,我不知道哪些元素可能具有重复的属性,也不提前知道重复的属性名称。

显然,将内容加载到 XMLDocument 对象中会在重复属性上引发 XmlException,因此我认为可以使用 XmlReader 逐个元素地遍历 XML 并处理当我到达有问题的元素时出现重复的属性。

但是,在我有机会检查元素的属性之前,XmlException 是在 reader.Read() 上引发的。

这是演示该问题的示例方法:

public static void ParseTest()
{
    const string xmlString = 
        @"<?xml version='1.0'?>
        <!-- This is a sample XML document -->
        <Items dupattr=""10"" id=""20"" dupattr=""33"">
            <Item>test with a child element <more/> stuff</Item>
        </Items>";

    var output = new StringBuilder();
    using (XmlReader reader = XmlReader.Create(new StringReader(xmlString)))
    {
        XmlWriterSettings ws = new XmlWriterSettings();
        ws.Indent = true;
        using (XmlWriter writer = XmlWriter.Create(output, ws))
        {
            while (reader.Read())   /* Exception throw here when Items element encountered */
            {
                switch (reader.NodeType)
                {
                    case XmlNodeType.Element:
                        writer.WriteStartElement(reader.Name);
                        if (reader.HasAttributes){ /* CopyNonDuplicateAttributes(); */}
                        break;
                    case XmlNodeType.Text:
                        writer.WriteString(reader.Value);
                        break;
                    case XmlNodeType.XmlDeclaration:
                    case XmlNodeType.ProcessingInstruction:
                        writer.WriteProcessingInstruction(reader.Name, reader.Value);
                        break;
                    case XmlNodeType.Comment:
                        writer.WriteComment(reader.Value);
                        break;
                    case XmlNodeType.EndElement:
                        writer.WriteFullEndElement();
                        break;
                }
            }

        }
    }
    string str = output.ToString();
}

是否有另一种方法可以解析输入并删除重复的属性,而无需使用正则表达式和字符串操作?

I am parsing some XML files from a third party provider and unfortunately it's not always well-formed XML as sometimes some elements contain duplicate attributes.

I don't have control over the source and I don't know which elements may have duplicate attributes nor do I know the duplicate attribute names in advance.

Obviously, loading the content into an XMLDocument object raises an XmlException on the duplicate attributes so I though I could use an XmlReader to step though the XML element by element and deal with the duplicate attributes when I get to the offending element.

However, the XmlException is raised on reader.Read() - before I get a chance to insepct the element's attributes.

Here's a sample method to demonstrate the issue:

public static void ParseTest()
{
    const string xmlString = 
        @"<?xml version='1.0'?>
        <!-- This is a sample XML document -->
        <Items dupattr=""10"" id=""20"" dupattr=""33"">
            <Item>test with a child element <more/> stuff</Item>
        </Items>";

    var output = new StringBuilder();
    using (XmlReader reader = XmlReader.Create(new StringReader(xmlString)))
    {
        XmlWriterSettings ws = new XmlWriterSettings();
        ws.Indent = true;
        using (XmlWriter writer = XmlWriter.Create(output, ws))
        {
            while (reader.Read())   /* Exception throw here when Items element encountered */
            {
                switch (reader.NodeType)
                {
                    case XmlNodeType.Element:
                        writer.WriteStartElement(reader.Name);
                        if (reader.HasAttributes){ /* CopyNonDuplicateAttributes(); */}
                        break;
                    case XmlNodeType.Text:
                        writer.WriteString(reader.Value);
                        break;
                    case XmlNodeType.XmlDeclaration:
                    case XmlNodeType.ProcessingInstruction:
                        writer.WriteProcessingInstruction(reader.Name, reader.Value);
                        break;
                    case XmlNodeType.Comment:
                        writer.WriteComment(reader.Value);
                        break;
                    case XmlNodeType.EndElement:
                        writer.WriteFullEndElement();
                        break;
                }
            }

        }
    }
    string str = output.ToString();
}

Is there another way to parse the input and remove the duplicate attributes without having to use regular expressions and string manipulation?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

十年不长 2024-11-26 01:09:01

我通过将 XML 视为 HTML 文档找到了解决方案。然后使用开源 Html Agility Pack 库,我能够获取有效的 XML。

诀窍是首先使用 HTML 标头保存 xml。
因此替换 XML 声明

带有这样的 HTML 声明:
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

将内容保存到文件后,此方法将返回有效的 XML 文档。

// Requires reference to HtmlAgilityPack
public XmlDocument LoadHtmlAsXml(string url)
{
    var web = new HtmlWeb();

    var m = new MemoryStream();
    var xtw = new XmlTextWriter(m, null);

    // Load the content into the writer
    web.LoadHtmlAsXml(url, xtw);

    // Rewind the memory stream
    m.Position = 0;

    // Create, fill, and return the xml document
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.LoadXml((new StreamReader(m)).ReadToEnd());
    return xmlDoc;
}

重复的属性节点将被自动删除,并且后面的属性值会覆盖前面的属性值。

I found a solution by thinking of the XML as an HTML document. Then using the open-source Html Agility Pack library, I was able to get valid XML.

The trick was to save the xml with a HTML header first.
So replace the XML declaration
<?xml version="1.0" encoding="utf-8" ?>
with an HTML declaration like this:
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Once the contents are saved to file, this method will return a valid XML Document.

// Requires reference to HtmlAgilityPack
public XmlDocument LoadHtmlAsXml(string url)
{
    var web = new HtmlWeb();

    var m = new MemoryStream();
    var xtw = new XmlTextWriter(m, null);

    // Load the content into the writer
    web.LoadHtmlAsXml(url, xtw);

    // Rewind the memory stream
    m.Position = 0;

    // Create, fill, and return the xml document
    XmlDocument xmlDoc = new XmlDocument();
    xmlDoc.LoadXml((new StreamReader(m)).ReadToEnd());
    return xmlDoc;
}

The duplicate attribute nodes are automatically removed with the later attribute values overwriting the earlier ones.

悲歌长辞 2024-11-26 01:09:01

好吧,认为您需要捕获错误:

那么您应该能够使用以下方法:

reader.MoveToFirstAttribute();

reader.MoveToNextAttribute()

获取以下属性:

reader.Value
reader.Name

这将使您能够获取所有属性值。

Ok think you need to catch the error:

Then you should be able to use the following methods:

reader.MoveToFirstAttribute();

and

reader.MoveToNextAttribute()

to get the following properties:

reader.Value
reader.Name

This will enable you to get all the attribute values.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文