当前位置：文江博客话题详情

HTML XML Java xml-parsing

解析格式错误的 XML 文档（如 HTML 文件）

发布于 2024-10-07 08:03:26 字数 246 浏览 6 评论 0原文

解析后，我想删除危险代码并再次以正确的格式写出。

目的是防止脚本通过电子邮件输入，但仍然允许大量不良 HTML 工作（至少不会完全失败）。

有图书馆吗？有没有更好的方法让脚本远离浏览器？

重要的是程序不会抛出 Parse Exception。该程序可能会做出最好的猜测，即使它是错误的，它也是可以接受的。

编辑：对于你们认为哪些解析器更好以及为什么更好的任何评论，我将不胜感激。

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（4）

总以为 2024-10-14 08:03:26

对于灵活的解析，您可能需要查看 JSoup。但白名单是解决问题的方法。如果您只是禁止一堆“危险”元素，那么有人可能会找到一种方法来通过您的解析器窃取某些内容。相反，您应该只允许一小部分安全元素。

回复收藏 0 原文

定格我的天空 2024-10-14 08:03:26

使用将 HTML 转换为 XHTML 的可用工具之一。

例如

http://www.chilkatsoft.com/java-html.asp

http://java-source.net/open-source/html-parsers

http://htmlcleaner.sourceforge.net/

等

然后使用常规 XML 解析器。

回复收藏 0 原文

初见终念 2024-10-14 08:03:26

为此，我使用 Jericho HTML 解析器。

他们的消毒剂示例的稍微调整版本：

public class HtmlSanitizer {

private HtmlSanitizer() {
}

private static final Set<String> VALID_ELEMENTS = Sets.newHashSet(DIV, BR,
        P, B, I, OL, UL, LI, A, STRONG, SPAN, EM, TT, IMG);


private static final Set<String> VALID_ATTRIBUTES = Sets.newHashSet("id",
        "class", "href", "target", "title", "src");

private static final Object VALID_MARKER = new Object();

public static void sanitize(Reader r, Writer w) {
    try {
        sanitize(new Source(r)).writeTo(w);
        w.flush();
        r.close();
    } catch (IOException ioe) {
        throw new RuntimeException("error during sanitize", ioe);
    }
}

public static OutputDocument sanitize(Source source) {
    source.fullSequentialParse();
    OutputDocument doc = new OutputDocument(source);
    List<Tag> tags = source.getAllTags();
    int pos = 0;
    for (Tag tag : tags) {
        if (processTag(tag, doc))
            tag.setUserData(VALID_MARKER);
        else
            doc.remove(tag);
        reencodeTextSegment(source, doc, pos, tag.getBegin());
        pos = tag.getEnd();
    }
    reencodeTextSegment(source, doc, pos, source.getEnd());
    return doc;
}

private static boolean processTag(Tag tag, OutputDocument doc) {
    String elementName = tag.getName();
    if (!VALID_ELEMENTS.contains(elementName))
        return false;
    if (tag.getTagType() == StartTagType.NORMAL) {
        Element element = tag.getElement();
        if (HTMLElements.getEndTagRequiredElementNames().contains(
                elementName)) {
            if (element.getEndTag() == null)
                return false;
        } else if (HTMLElements.getEndTagOptionalElementNames().contains(
                elementName)) {
            if (elementName == HTMLElementName.LI && !isValidLITag(tag))
                return false;
            if (element.getEndTag() == null)
                doc.insert(element.getEnd(), getEndTagHTML(elementName));

        }
        doc.replace(tag, getStartTagHTML(element.getStartTag()));
    } else if (tag.getTagType() == EndTagType.NORMAL) {
        if (tag.getElement() == null)
            return false;
        if (elementName == HTMLElementName.LI && !isValidLITag(tag))
            return false;
        doc.replace(tag, getEndTagHTML(elementName));
    } else {
        return false;
    }
    return true;
}

private static boolean isValidLITag(Tag tag) {
    Element parentElement = tag.getElement().getParentElement();
    if (parentElement == null
            || parentElement.getStartTag().getUserData() != VALID_MARKER)
        return false;
    return parentElement.getName() == HTMLElementName.UL
            || parentElement.getName() == HTMLElementName.OL;
}

private static void reencodeTextSegment(Source source, OutputDocument doc,
        int begin, int end) {
    if (begin >= end)
        return;
    Segment textSegment = new Segment(source, begin, end);
    String encodedText = encode(decode(textSegment));
    doc.replace(textSegment, encodedText);
}

private static CharSequence getStartTagHTML(StartTag startTag) {
    StringBuilder sb = new StringBuilder();
    sb.append('<').append(startTag.getName());
    for (Attribute attribute : startTag.getAttributes()) {
        if (VALID_ATTRIBUTES.contains(attribute.getKey())) {
            sb.append(' ').append(attribute.getName());
            if (attribute.getValue() != null) {
                sb.append("=\"");
                sb.append(CharacterReference.encode(attribute.getValue()));
                sb.append('"');
            }
        }
    }
    if (startTag.getElement().getEndTag() == null
            && !HTMLElements.getEndTagOptionalElementNames().contains(
                    startTag.getName()))
        sb.append('/');
    sb.append('>');
    return sb;
}

private static String getEndTagHTML(String tagName) {
    return "</" + tagName + '>';
}

}

I use the Jericho HTML parser for this purpose.

Somewhat tweaked version of their sanitizer example:

public class HtmlSanitizer {

private HtmlSanitizer() {
}

private static final Set<String> VALID_ELEMENTS = Sets.newHashSet(DIV, BR,
        P, B, I, OL, UL, LI, A, STRONG, SPAN, EM, TT, IMG);


private static final Set<String> VALID_ATTRIBUTES = Sets.newHashSet("id",
        "class", "href", "target", "title", "src");

private static final Object VALID_MARKER = new Object();

public static void sanitize(Reader r, Writer w) {
    try {
        sanitize(new Source(r)).writeTo(w);
        w.flush();
        r.close();
    } catch (IOException ioe) {
        throw new RuntimeException("error during sanitize", ioe);
    }
}

public static OutputDocument sanitize(Source source) {
    source.fullSequentialParse();
    OutputDocument doc = new OutputDocument(source);
    List<Tag> tags = source.getAllTags();
    int pos = 0;
    for (Tag tag : tags) {
        if (processTag(tag, doc))
            tag.setUserData(VALID_MARKER);
        else
            doc.remove(tag);
        reencodeTextSegment(source, doc, pos, tag.getBegin());
        pos = tag.getEnd();
    }
    reencodeTextSegment(source, doc, pos, source.getEnd());
    return doc;
}

private static boolean processTag(Tag tag, OutputDocument doc) {
    String elementName = tag.getName();
    if (!VALID_ELEMENTS.contains(elementName))
        return false;
    if (tag.getTagType() == StartTagType.NORMAL) {
        Element element = tag.getElement();
        if (HTMLElements.getEndTagRequiredElementNames().contains(
                elementName)) {
            if (element.getEndTag() == null)
                return false;
        } else if (HTMLElements.getEndTagOptionalElementNames().contains(
                elementName)) {
            if (elementName == HTMLElementName.LI && !isValidLITag(tag))
                return false;
            if (element.getEndTag() == null)
                doc.insert(element.getEnd(), getEndTagHTML(elementName));

        }
        doc.replace(tag, getStartTagHTML(element.getStartTag()));
    } else if (tag.getTagType() == EndTagType.NORMAL) {
        if (tag.getElement() == null)
            return false;
        if (elementName == HTMLElementName.LI && !isValidLITag(tag))
            return false;
        doc.replace(tag, getEndTagHTML(elementName));
    } else {
        return false;
    }
    return true;
}

private static boolean isValidLITag(Tag tag) {
    Element parentElement = tag.getElement().getParentElement();
    if (parentElement == null
            || parentElement.getStartTag().getUserData() != VALID_MARKER)
        return false;
    return parentElement.getName() == HTMLElementName.UL
            || parentElement.getName() == HTMLElementName.OL;
}

private static void reencodeTextSegment(Source source, OutputDocument doc,
        int begin, int end) {
    if (begin >= end)
        return;
    Segment textSegment = new Segment(source, begin, end);
    String encodedText = encode(decode(textSegment));
    doc.replace(textSegment, encodedText);
}

private static CharSequence getStartTagHTML(StartTag startTag) {
    StringBuilder sb = new StringBuilder();
    sb.append('<').append(startTag.getName());
    for (Attribute attribute : startTag.getAttributes()) {
        if (VALID_ATTRIBUTES.contains(attribute.getKey())) {
            sb.append(' ').append(attribute.getName());
            if (attribute.getValue() != null) {
                sb.append("=\"");
                sb.append(CharacterReference.encode(attribute.getValue()));
                sb.append('"');
            }
        }
    }
    if (startTag.getElement().getEndTag() == null
            && !HTMLElements.getEndTagOptionalElementNames().contains(
                    startTag.getName()))
        sb.append('/');
    sb.append('>');
    return sb;
}

private static String getEndTagHTML(String tagName) {
    return "</" + tagName + '>';
}

}

回复收藏 0 原文

凹づ凸ル 2024-10-14 08:03:26

看看 http://nekohtml.sourceforge.net/ 它具有内置的标签平衡功能。
另请查看 Nekohtml 的自定义过滤器部分 http://nekohtml.sourceforge.net/filters.html #filters.removing 。这是一个非常好的html解析器。

回复收藏 0 原文

~没有更多了~

关于作者

暂无简介

文章

评论

25 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

丶视觉

文章 0 评论 0

蓝礼

文章 0 评论 0

birdxs

文章 0 评论 0

foonlee

文章 0 评论 0

微信用户

文章 0 评论 0

っ〆星空下的拥抱

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文