HTML 格式良好的解析器

发布于 2024-10-19 19:29:40 字数 1161 浏览 5 评论 0原文

嘿伙计们,我需要确定给定的 HTML 文档是否格式良好。
我只需要一个仅使用 Java 核心 API 类的简单实现,即没有像 JTIDY 之类的第三方东西。

实际上,我们真正需要的是一个扫描标签列表的算法。如果它找到一个开放标签,并且下一个标签不是其相应的关闭标签,那么它应该是另一个开放标签,而该标签又应该将其关闭标签作为下一个标签,如果没有,它应该是另一个开放标签,然后接下来是其相应的关闭标签,并且前一个打开标签的关闭标签以相反的顺序在列表中一个接一个地出现。如果列表符合此顺序,则返回 true,否则返回 false。我已经编写了将标签转换为结束标签的方法。

这是我已经开始研究的框架代码。它不是太简洁,但它应该让你们对我正在尝试做的事情有一个基本的了解。

public boolean validateHtml(){

    ArrayList<String> tags = fetchTags();
    //fetchTags returns this [<html>, <head>, <title>, </title>, </head>, <body>, <h1>, </h1>, </body>, </html>]

    //I create another ArrayList to store tags that I haven't found its corresponding close tag yet
    ArrayList<String> unclosedTags = new ArrayList<String>();

    String temp;

    for (int i = 0; i < tags.size(); i++) {

        temp = tags.get(i);

        if(!tags.get(i+1).equals(TagOperations.convertToCloseTag(tags.get(i)))){
            unclosedTags.add(tags.get(i));
            if(){

            }

        }else{
            return true;//well formed html
        }
    }

    return true;
}

Heyy guys, I need to determine if a given HTML Document is well formed or not.
I just need a simple implementation using only Java core API classes i.e. no third party stuff like JTIDY or something.

Actually, what is exactly needed is an algorithm that scans a list of TAGS. If it finds an open tag, and the next tag isn't its corresponding close tag, then it should be another open tag which in turn should have its close tag as the next tag, and if not it should be another open tag and then its corresponding close tag next, and the close tags of the previous open tags in reverse order coming one after the other on the list. If the list conforms to this order then it returns true or else false. I've already written methods to convert a tag to a close tag.

Here is the skeleton code of what I've started working on already. Its not too neat, but it should give you guys a basic idea of what I'm trying to do.

public boolean validateHtml(){

    ArrayList<String> tags = fetchTags();
    //fetchTags returns this [<html>, <head>, <title>, </title>, </head>, <body>, <h1>, </h1>, </body>, </html>]

    //I create another ArrayList to store tags that I haven't found its corresponding close tag yet
    ArrayList<String> unclosedTags = new ArrayList<String>();

    String temp;

    for (int i = 0; i < tags.size(); i++) {

        temp = tags.get(i);

        if(!tags.get(i+1).equals(TagOperations.convertToCloseTag(tags.get(i)))){
            unclosedTags.add(tags.get(i));
            if(){

            }

        }else{
            return true;//well formed html
        }
    }

    return true;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

oО清风挽发oО 2024-10-26 19:29:40

两个想法。首先,也许您可​​以在 html 上使用 XML 解析器?可能更容易并且耗时更少。

我没有对此进行太多思考,但对我来说,听起来递归和堆栈将是正确的选择。像

public myClass(String htmlInput)
{
    openedTags = new Stack<String>();
    this.htmlInput = htmlInput;
}
public boolean validate()
{
    return validate(this.htmlInput);
}
private boolean validate(String html)
{
    boolean result = true;
    String curTag;
    while(htmlLeft)        //worker loop
    {

        if(isOneOffTag(curTag))                 //matches <tags />
            continue;
        else if(isOpenTag(curTag))              //matches <tags>
        {
            openedTags.push(curTag);
            if(!validate(innerHtml))
                return false;
        }
        else if(isCloseTag(curTag))             //matches </tags>
        {
            String lastTag = (String)openedTags.peek();
            if(!tagIsSimiliar(curTag, lastTag))
                return false;
            openedTags.pop();
        }
    }


    return result;
}
private String nextTag(){return null;}
private boolean isOpenTag(String tag){ return true;}
private boolean isCloseTag(String tag){ return true;}
private boolean isOneOffTag(String tag){ return true;}
private boolean tagIsSimiliar(String curTag, String lastTag){return true;}

*edit 1: 这样的东西可能应该被压入堆栈。

**编辑2:我想这里的问题是确定当只返回一个布尔值时你已经离开的位置。这需要某种指针,以便您知道自己从哪里停下来。但我相信这个想法仍然有效。

Two thoughts. First off maybe you could get away with using an XML parser on the html? Potentially easier and vastly less time consuming.

I havn't put a whole lot of thought into this but to me it sounds like recursion and stack would be the way to go. Something like

public myClass(String htmlInput)
{
    openedTags = new Stack<String>();
    this.htmlInput = htmlInput;
}
public boolean validate()
{
    return validate(this.htmlInput);
}
private boolean validate(String html)
{
    boolean result = true;
    String curTag;
    while(htmlLeft)        //worker loop
    {

        if(isOneOffTag(curTag))                 //matches <tags />
            continue;
        else if(isOpenTag(curTag))              //matches <tags>
        {
            openedTags.push(curTag);
            if(!validate(innerHtml))
                return false;
        }
        else if(isCloseTag(curTag))             //matches </tags>
        {
            String lastTag = (String)openedTags.peek();
            if(!tagIsSimiliar(curTag, lastTag))
                return false;
            openedTags.pop();
        }
    }


    return result;
}
private String nextTag(){return null;}
private boolean isOpenTag(String tag){ return true;}
private boolean isCloseTag(String tag){ return true;}
private boolean isOneOffTag(String tag){ return true;}
private boolean tagIsSimiliar(String curTag, String lastTag){return true;}

*edit 1: probably should have pushed onto the stack.

**edit 2: I suppose the issue here would be to determine where when returning solely a boolean you've left off. This would require some kind of pointer so that you know where you've left off. The idea though i believe would still work.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文