需要一个应用程序来修复带有未转义字符的 XML

发布于 2024-10-18 10:55:16 字数 345 浏览 6 评论 0原文

XML (rdf 文件扩展名,但为 XML) 是由自动工具,但不幸的是有各种“未转义”字符串

<tag xml:lang="fr">L'insuline (du latin insula, île) </tag>

,例如解析器(和推理软件)因此崩溃...

JavaPHP 解决方案对我来说也有效!

谢谢, 塞尔索

This XML (rdf file extension, but is XML) was generated by a automatic tool, but unfortunately have various "unescaped" strings like

<tag xml:lang="fr">L'insuline (du latin insula, île) </tag>

And the parser (and reasoner software) crash with this...

Java or PHP solutions are valid to me too!

Thanks,
Celso

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

三月梨花 2024-10-25 10:55:16

下面是我经常使用的一种通用方法,以确保对 XML 正确转义字符串。

private static final String AMP = "&";
private static final String LT = "<";
private static final String GT = ">";
private static final String QUOTE = """;
private static final String APOS = "'";

public static String encodeEntities(String dirtyString) {

    StringBuffer buff = new StringBuffer();
    char[] chars = dirtyString.toCharArray();

    for (int i = 0; i < chars.length; i++) {
        if (chars[i] > 0x7f) {
            buff.append("&#" + (int) chars[i] + ";");
            continue;
        }

        switch (chars[i]) {
        case '&':
            buff.append(AMP);
            break;
        case '<':
            buff.append(LT);
            break;
        case '\'':
            buff.append(APOS);
            break;
        case '"':
            buff.append(QUOTE);
            break;
        case '>':
            buff.append(GT);
            break;
        default:
            buff.append(chars[i]);
            break;
        }
    }

    return buff.toString();
}

Here's a general method that I use a lot to make sure a String is escaped properly for XML.

private static final String AMP = "&";
private static final String LT = "<";
private static final String GT = ">";
private static final String QUOTE = """;
private static final String APOS = "'";

public static String encodeEntities(String dirtyString) {

    StringBuffer buff = new StringBuffer();
    char[] chars = dirtyString.toCharArray();

    for (int i = 0; i < chars.length; i++) {
        if (chars[i] > 0x7f) {
            buff.append("&#" + (int) chars[i] + ";");
            continue;
        }

        switch (chars[i]) {
        case '&':
            buff.append(AMP);
            break;
        case '<':
            buff.append(LT);
            break;
        case '\'':
            buff.append(APOS);
            break;
        case '"':
            buff.append(QUOTE);
            break;
        case '>':
            buff.append(GT);
            break;
        default:
            buff.append(chars[i]);
            break;
        }
    }

    return buff.toString();
}
早茶月光 2024-10-25 10:55:16

OP 给出的 xml 是格式正确的 xml,因为单引号字符有效,扬抑符“i”也是有效的,两者都不需要转义。我会确保您使用的是文本编码,例如 UTF-8。这是执行身份转换的快速 java 示例:

public static void main(String[] args) throws Exception {
    Transformer t = TransformerFactory.newInstance().newTransformer();
    StreamResult s = new StreamResult(System.out);
    t.transform(new StreamSource(new StringReader("<tag xml:lang=\"fr\">L'insuline (du latin insula, île) </tag>")), s);
}

The xml given by the OP is well-formed xml as the single quote character is valid and so is the circumflex "i", neither needs escaping. I would make sure you're using a text encoding such as UTF-8. Here's quick java example that does an identity transformation:

public static void main(String[] args) throws Exception {
    Transformer t = TransformerFactory.newInstance().newTransformer();
    StreamResult s = new StreamResult(System.out);
    t.transform(new StreamSource(new StringReader("<tag xml:lang=\"fr\">L'insuline (du latin insula, île) </tag>")), s);
}
止于盛夏 2024-10-25 10:55:16

OP 给出的 XML 片段看起来格式良好。撇号和抑扬符都不需要转义。最可能的问题是 XML 使用 iso-8859-1 进行编码,但缺少 XML 声明,因此解析器认为它是 UTF-8 编码。那么解决方案是添加 XML 声明 ,它告诉解析器如何解码字符。 (对于仅包含 ASCII 字符的文档,iso-8859-1 和 utf-8 是无法区分的,因此只有当您使用 ASCII 范围之外的字符时才会出现此问题)。

一句建议:如果您给出了解析器生成的错误消息,您就不会得到这么多错误的答案。

The XML fragment given by the OP looks well-formed. Neither the apostrophe nor the i-circumflex needs escaping. The most likely problem is that the XML is encoded using iso-8859-1, but lacks an XML declaration, so the parser think it is in UTF-8 encoding. The solution then is to add the XML declaration <?xml version="1.0" encoding="iso-8859-1"?>, which tells the parser how to decode the characters. (For a document containing only ASCII characters, iso-8859-1 and utf-8 are indistinguishable, so this problem only surfaces when you use characters outside the ASCII range).

A word of advice: if you had given the error message generated by the parser, you wouldn't have got so many incorrect answers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文