如何解析节点名称中包含无效字符的 XML?

发布于 2024-07-26 11:01:25 字数 627 浏览 8 评论 0原文

所以我试图解析一些 XML,其创建不在我的控制之下。 问题是,他们以某种方式获得了如下所示的节点:

<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(MORNINGSTAR) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(QUARTERSTAFF) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(SCYTHE) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(TRATNYR) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(TRIPLE-HEADED_FLAIL) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(WARAXE) />

Visual Studio 和 .NET 都认为上面使用的“(”和“)”字符完全无效。 不幸的是,我需要处理这些文件! 有什么方法可以让 Xml Reader 类在看到这些字符时不会惊慌失措,或者动态转义它们或其他什么? 我可以对整个文件进行某种预处理,但我确实想要“(”和“)”字符,如果它们以某种有效的方式出现在节点内,所以我不想将它们全部删除。 ..

So I'm trying to parse some XML, the creation of which is not under my control. The trouble is, they've somehow got nodes that look like this:

<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(MORNINGSTAR) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(QUARTERSTAFF) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(SCYTHE) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(TRATNYR) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(TRIPLE-HEADED_FLAIL) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(WARAXE) />

Visual Studio and .NET both feel that the '(' and ')' characters, as used above, are totally invalid. Unfortunately, I need to process these files! Is there any way to get the Xml Reader classes to not freak out at seeing these characters, or dynamically escape them or something? I could do some sort of pre-processing on the whole file, but I DO want the '(' and ')' characters if they appear inside the node in some valid way, so I don't want to just remove them all...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

何止钟意 2024-08-02 11:01:25

这根本就是无效的。 预处理是你最好的选择,也许使用正则表达式 - 类似:

string output = Regex.Replace(input, @"(<\w+)\((\w+)\)([ >/])", "$1$2$3");

编辑:替换括号内的“-”要更复杂一些:

string output = Regex.Replace(input, @"(<\w+)\(([-\w]+)\)([ >/])",
    delegate(Match match) {
        return match.Groups[1].Value + match.Groups[2].Value.Replace('-', '_')
             + match.Groups[3].Value;
    });

That simply isn't valid. Pre-processing is your best-bet, perhaps with regex - something like:

string output = Regex.Replace(input, @"(<\w+)\((\w+)\)([ >/])", "$1$2$3");

Edit: a bit more complex to replace the "-" inside the brackets:

string output = Regex.Replace(input, @"(<\w+)\(([-\w]+)\)([ >/])",
    delegate(Match match) {
        return match.Groups[1].Value + match.Groups[2].Value.Replace('-', '_')
             + match.Groups[3].Value;
    });
鱼忆七猫命九 2024-08-02 11:01:25

如果它在语法上无效,那么它就不是 XML。

XML对此非常严格。

如果您无法让发送应用程序发送正确的 XML,那么只需让他们知道无论下游进程看到此都会失败,无论是您的还是其他应用程序未来。

如果无法进行预处理,另一种巧妙的机制是使用自定义流包装传递给解析器的 Stream 对象。 该流可以查找 < 字符,当它看到一个字符时,设置一个标志。 在看到 > 字符之前,它可以吃掉任何 字符。 我们使用类似的方法来删除由遗留传输机制添加到 XML 文件中的 NUL 和 ^Z 字符。 (唯一的问题可能是属性内部有 < 字符,因为它们不必在那里转义 - 只有 > 字符可以。)

If it isn't syntactically valid, it's not XML.

XML is very strict about this.

If you can't get the sending application to send correct XML, then just let them know that whatever downstream process sees this will fail, whether it's yours or some other app in the future.

If preprocessing isn't an option, another clever mechanism is to wrap the Stream object that is passed to the parser with a custom stream. That stream could look for < characters, and when it sees one, set a flag. Until a > character is see, it could eat any ( or ) characters. We've used something like this to get rid of NUL and ^Z characters added to an XML file by a legacy transport mechanism. (The only gotcha there might be < characters inside of an attribute, since they don't have to be escaped there - only > characters do.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文