如何使用 C# 从 XML 中删除重复属性
我正在解析来自第三方提供商的一些 XML 文件,不幸的是它并不总是格式良好的 XML,因为有时某些元素包含重复的属性。
我无法控制源,我不知道哪些元素可能具有重复的属性,也不提前知道重复的属性名称。
显然,将内容加载到 XMLDocument
对象中会在重复属性上引发 XmlException,因此我认为可以使用 XmlReader
逐个元素地遍历 XML 并处理当我到达有问题的元素时出现重复的属性。
但是,在我有机会检查元素的属性之前,XmlException
是在 reader.Read()
上引发的。
这是演示该问题的示例方法:
public static void ParseTest()
{
const string xmlString =
@"<?xml version='1.0'?>
<!-- This is a sample XML document -->
<Items dupattr=""10"" id=""20"" dupattr=""33"">
<Item>test with a child element <more/> stuff</Item>
</Items>";
var output = new StringBuilder();
using (XmlReader reader = XmlReader.Create(new StringReader(xmlString)))
{
XmlWriterSettings ws = new XmlWriterSettings();
ws.Indent = true;
using (XmlWriter writer = XmlWriter.Create(output, ws))
{
while (reader.Read()) /* Exception throw here when Items element encountered */
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
writer.WriteStartElement(reader.Name);
if (reader.HasAttributes){ /* CopyNonDuplicateAttributes(); */}
break;
case XmlNodeType.Text:
writer.WriteString(reader.Value);
break;
case XmlNodeType.XmlDeclaration:
case XmlNodeType.ProcessingInstruction:
writer.WriteProcessingInstruction(reader.Name, reader.Value);
break;
case XmlNodeType.Comment:
writer.WriteComment(reader.Value);
break;
case XmlNodeType.EndElement:
writer.WriteFullEndElement();
break;
}
}
}
}
string str = output.ToString();
}
是否有另一种方法可以解析输入并删除重复的属性,而无需使用正则表达式和字符串操作?
I am parsing some XML files from a third party provider and unfortunately it's not always well-formed XML as sometimes some elements contain duplicate attributes.
I don't have control over the source and I don't know which elements may have duplicate attributes nor do I know the duplicate attribute names in advance.
Obviously, loading the content into an XMLDocument
object raises an XmlException on the duplicate attributes so I though I could use an XmlReader
to step though the XML element by element and deal with the duplicate attributes when I get to the offending element.
However, the XmlException
is raised on reader.Read()
- before I get a chance to insepct the element's attributes.
Here's a sample method to demonstrate the issue:
public static void ParseTest()
{
const string xmlString =
@"<?xml version='1.0'?>
<!-- This is a sample XML document -->
<Items dupattr=""10"" id=""20"" dupattr=""33"">
<Item>test with a child element <more/> stuff</Item>
</Items>";
var output = new StringBuilder();
using (XmlReader reader = XmlReader.Create(new StringReader(xmlString)))
{
XmlWriterSettings ws = new XmlWriterSettings();
ws.Indent = true;
using (XmlWriter writer = XmlWriter.Create(output, ws))
{
while (reader.Read()) /* Exception throw here when Items element encountered */
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
writer.WriteStartElement(reader.Name);
if (reader.HasAttributes){ /* CopyNonDuplicateAttributes(); */}
break;
case XmlNodeType.Text:
writer.WriteString(reader.Value);
break;
case XmlNodeType.XmlDeclaration:
case XmlNodeType.ProcessingInstruction:
writer.WriteProcessingInstruction(reader.Name, reader.Value);
break;
case XmlNodeType.Comment:
writer.WriteComment(reader.Value);
break;
case XmlNodeType.EndElement:
writer.WriteFullEndElement();
break;
}
}
}
}
string str = output.ToString();
}
Is there another way to parse the input and remove the duplicate attributes without having to use regular expressions and string manipulation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我通过将 XML 视为 HTML 文档找到了解决方案。然后使用开源 Html Agility Pack 库,我能够获取有效的 XML。
诀窍是首先使用 HTML 标头保存 xml。
因此替换 XML 声明
带有这样的 HTML 声明:
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
将内容保存到文件后,此方法将返回有效的 XML 文档。
重复的属性节点将被自动删除,并且后面的属性值会覆盖前面的属性值。
I found a solution by thinking of the XML as an HTML document. Then using the open-source Html Agility Pack library, I was able to get valid XML.
The trick was to save the xml with a HTML header first.
So replace the XML declaration
<?xml version="1.0" encoding="utf-8" ?>
with an HTML declaration like this:
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Once the contents are saved to file, this method will return a valid XML Document.
The duplicate attribute nodes are automatically removed with the later attribute values overwriting the earlier ones.
好吧,认为您需要捕获错误:
那么您应该能够使用以下方法:
并
获取以下属性:
这将使您能够获取所有属性值。
Ok think you need to catch the error:
Then you should be able to use the following methods:
and
to get the following properties:
This will enable you to get all the attribute values.