使用 SAX 解析器解析大型 XML 文件，该类变得臃肿且不可读 - 如何解决此问题？

发布于 2024-09-16 09:16:11 字数 471 浏览 16 评论 0原文

这纯粹是一个与代码可读性相关的问题，类的性能不是问题。

以下是我构建此 XMLHandler 的方法：

对于与应用程序相关的每个元素，我在“ElementName”中有一个布尔值，根据解析过程中我的位置将其设置为 true 或 false：问题，我现在有 10 多个布尔值我的班级开始时的宣言，而且它变得越来越大。

在我的 startElement 和 endElement 方法中，我有数百行

if (qName = "elementName") {
   ...
} else if (qName = "anotherElementName") {
   ...
}

不同的解析规则（如果我在 xml 文件中的这个位置，则执行此操作，否则执行此操作等...）

编码新的解析规则和调试变得越来越痛苦。

编写 sax 解析器的最佳实践是什么？我可以做些什么来使我的代码更具可读性？

原文

This is purely a code readability related question, the performance of the class is not an issue.

Here is how I am building this XMLHandler :

For each element that is relevant to the application, I have a boolean in'ElementName' which I set to true or false depending on my location during the parsing : Problem, I now have 10+ boolean declaration at the beginning of my class and it is getting bigger and bigger.

In my startElement and in my endElement method, I have hundreds of line of

if (qName = "elementName") {
   ...
} else if (qName = "anotherElementName") {
   ...
}

with different parsing rules in them (if I am in this position in the xml file, do this, otherwise, do this etc...)

Coding new parsing rules and debugging is becoming increasingly painfull.

What are the best practices for coding a sax parser, and what can I do to make my code more readable ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

何止钟意 2024-09-23 09:16:11

布尔变量有什么用？为了跟踪筑巢？

我最近通过对每个元素使用枚举来实现这一点。
代码正在运行，但这是我脑海中的粗略近似：

enum Element {
   // special markers:
   ROOT,
   DONT_CARE,

   // Element               tag                  parents
   RootElement(             "root"               ROOT),
   AnElement(               "anelement"),     // DONT_CARE
   AnotherElement(          "anotherelement"),// DONT_CARE
   AChild(                  "child",             AnElement),
   AnotherChild(            "child",             AnotherElement);

   Element() {...}
   Element(String tag, Element ... parents) {...}
}

class MySaxParser extends DefaultHandler {
    Map<Pair<Element, String>, Element> elementMap = buildElementMap();
    LinkedList<Element> nestingStack = new LinkedList<Element>();

    public void startElement(String namespaceURI, String sName, String qName, Attributes attrs) {
        Element parent = nestingStack.isEmpty() ? ROOT : nestingStack.lastElement();
        Element element = elementMap.get(pair(parent, sName));
        if (element == null)
            element = elementMap.get(DONT_CARE, sName);
        if (element == null)
            throw new IllegalStateException("I did not expect <" + sName + "> in this context");

        nestingStack.addLast(element);

        switch (element) {
        case RootElement: ... // Probably don't need cases for many elements at start unless we have attributes
        case AnElement: ...
        case AnotherElement: ...
        case AChild: ...
        case AnotherChild: ...
        default: // Most cases here. Generally nothing to do on startElement
        }
    }
    public void endElement(String namespaceURI, String sName, String qName) {
        // Similar to startElement() but most switch cases do something with the data.
        Element element = nestingStack.removeLast();
        if (!element.tag.equals(sName)) throw IllegalStateException();
        switch (element) {
           ...
        }
    }

    // Construct the structure map from the parent information.
    private Map<Pair<Element, String>, Element> buildElementMap() {
        Map<Pair<Element, String>, Element> result = new LinkedHashMap<Pair<Element, String>, Element>();
        for (Element element: Element.values()) {
            if (element.tag == null) continue;
            if (element.parents.length == 0)
                result.put(pair(DONT_CARE, element.tag), element);
            else for (Element parent: element.parents) {
                result.put(pair(parent, element.tag), element);
            }
        }
        return result;
    }
    // Convenience method to avoid the need for using "new Pair()" with verbose Type parameters 
    private <A,B> Pair<A,B> pair(A a, B b) {
        return new Pair<A, B>(a, b);
    }
    // A simple Pair class, just for completeness.  Better to use an existing implementation.
    private static class Pair<A,B> {
        final A a;
        final B b;
        Pair(A a, B b){ this.a = a; this.b = b;}
        public boolean equals(Object o) {...};
        public int hashCode() {...};
    }
}

编辑：
XML 结构中的位置由元素堆栈来跟踪。调用 startElement 时，可以通过使用 1) 跟踪堆栈中的父元素和 2) 作为 sName 参数传递的元素标记（作为从生成的 Map 的键）来确定适当的 Element 枚举。父信息定义为 Element 枚举的一部分。 Pair 类只是 2 部分密钥的持有者。

这种方法允许重复出现在 XML 结构的不同部分中且具有不同语义的相同元素标签由不同的 Element 枚举表示。例如：

<root>
  <anelement>
    <child>Data pertaining to child of anelement</child>
  </anelement>      
  <anotherelement>
    <child>Data pertaining to child of anotherelement</child>
  </anotherelement>
</root>

使用这种技术，我们不需要使用标志来跟踪上下文，以便我们知道正在处理哪个元素。上下文被声明为 Element 枚举定义的一部分，并通过消除各种状态变量来减少混乱。

What do you use the boolean variables for? To keep track of nesting?

I recently implemented this by using an enum for every element.
The code is at work but this is a rough approximation of it off the top of my head:

enum Element {
   // special markers:
   ROOT,
   DONT_CARE,

   // Element               tag                  parents
   RootElement(             "root"               ROOT),
   AnElement(               "anelement"),     // DONT_CARE
   AnotherElement(          "anotherelement"),// DONT_CARE
   AChild(                  "child",             AnElement),
   AnotherChild(            "child",             AnotherElement);

   Element() {...}
   Element(String tag, Element ... parents) {...}
}

class MySaxParser extends DefaultHandler {
    Map<Pair<Element, String>, Element> elementMap = buildElementMap();
    LinkedList<Element> nestingStack = new LinkedList<Element>();

    public void startElement(String namespaceURI, String sName, String qName, Attributes attrs) {
        Element parent = nestingStack.isEmpty() ? ROOT : nestingStack.lastElement();
        Element element = elementMap.get(pair(parent, sName));
        if (element == null)
            element = elementMap.get(DONT_CARE, sName);
        if (element == null)
            throw new IllegalStateException("I did not expect <" + sName + "> in this context");

        nestingStack.addLast(element);

        switch (element) {
        case RootElement: ... // Probably don't need cases for many elements at start unless we have attributes
        case AnElement: ...
        case AnotherElement: ...
        case AChild: ...
        case AnotherChild: ...
        default: // Most cases here. Generally nothing to do on startElement
        }
    }
    public void endElement(String namespaceURI, String sName, String qName) {
        // Similar to startElement() but most switch cases do something with the data.
        Element element = nestingStack.removeLast();
        if (!element.tag.equals(sName)) throw IllegalStateException();
        switch (element) {
           ...
        }
    }

    // Construct the structure map from the parent information.
    private Map<Pair<Element, String>, Element> buildElementMap() {
        Map<Pair<Element, String>, Element> result = new LinkedHashMap<Pair<Element, String>, Element>();
        for (Element element: Element.values()) {
            if (element.tag == null) continue;
            if (element.parents.length == 0)
                result.put(pair(DONT_CARE, element.tag), element);
            else for (Element parent: element.parents) {
                result.put(pair(parent, element.tag), element);
            }
        }
        return result;
    }
    // Convenience method to avoid the need for using "new Pair()" with verbose Type parameters 
    private <A,B> Pair<A,B> pair(A a, B b) {
        return new Pair<A, B>(a, b);
    }
    // A simple Pair class, just for completeness.  Better to use an existing implementation.
    private static class Pair<A,B> {
        final A a;
        final B b;
        Pair(A a, B b){ this.a = a; this.b = b;}
        public boolean equals(Object o) {...};
        public int hashCode() {...};
    }
}

Edit:
The position within the XML structure is tracked by a stack of elements. When startElement is called, the appropriate Element enum can be determined by using 1) the parent element from the tracking stack and 2) the element tag passed as the sName parameter as the key to a Map generated from the parent information defined as part of the Element enum. The Pair class is simply a holder for the 2-part key.

This approach allows the same element-tag that appears repeatedly in different parts of the XML structure with different semantics to be represented by different Element enums. For example:

<root>
  <anelement>
    <child>Data pertaining to child of anelement</child>
  </anelement>      
  <anotherelement>
    <child>Data pertaining to child of anotherelement</child>
  </anotherelement>
</root>

Using this technique, we don't need to use flags to track the context so that we know which <child> element is being processed. The context is declared as part of the Element enum definition and reduces confusion by eliminating assorted state variables.

回复收藏 0 原文

梓梦 2024-09-23 09:16:11

这取决于 XML 结构。如果不同情况下的操作很简单或（或多或少）“独立”，您可以尝试使用地图：

interface Command {
   public void assemble(Attributes attr, MyStructure myStructure);
}
...

Map<String, Command> commands= new HashMap<String, Command>();
...
if(commands.contains(qName)) {
   commands.get(qname).assemble(attr, myStructur);
} else {
   //unknown qName
}

It depends on the XML structure. If the actions for different cases are easy or (more or less) "independent", you could try to use a map:

interface Command {
   public void assemble(Attributes attr, MyStructure myStructure);
}
...

Map<String, Command> commands= new HashMap<String, Command>();
...
if(commands.contains(qName)) {
   commands.get(qname).assemble(attr, myStructur);
} else {
   //unknown qName
}

回复收藏 0 原文