使用 java DOM/SAX 解析 dblp.xml

发布于 2024-11-28 05:46:28 字数 3283 浏览 0 评论 0原文

我试图在java中解析dblp.xml以获取作者姓名/标题/年份等，但由于文件很大（860MB），我无法在完整文件上使用DOM/SAX。

所以我把文件分成了多个小文件，每个小文件大约100MB。

现在每个文件都包含各种（数千个）节点，如下所示：

<dblp>
<inproceedings mdate="2011-06-23" key="conf/aime/BianchiD95">
<author>Nadia Bianchi</author>
<author>Claudia Diamantini</author>
<title>Integration of Neural Networks and Rule Based Systems in the Interpretation of Liver     Biopsy Images.</title>
<pages>367-378</pages>
<year>1995</year>
<crossref>conf/aime/1995</crossref>
<booktitle>AIME</booktitle>
<url>db/conf/aime/aime1995.html#BianchiD95</url>
<ee>http://dx.doi.org/10.1007/3-540-60025-6_152</ee>
</inproceedings>
</dblp>

我假设 DOM 中应该可读 100MB，但代码在大约 45k 行后停止。这是我正在使用的java代码：

@SuppressWarnings({"unchecked", "null"})
public List<dblpModel> readConfigDOM(String configFile) {
    List<dblpModel> items = new ArrayList<dblpModel>();
    List<String> strList = null;
    dblpModel item = null;

    try {

        File fXmlFile = new File(configFile);
        DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        Document doc = dBuilder.parse(fXmlFile);
        doc.getDocumentElement().normalize();

        NodeList nList = doc.getElementsByTagName("incollection");

        for (int temp = 0; temp < nList.getLength(); temp++) {
            item = new dblpModel();
            strList = new ArrayList<String>();
            Node nNode = nList.item(temp);
            if (nNode.getNodeType() == Node.ELEMENT_NODE) {

                Element eElement = (Element) nNode;

                strList = getTagValueString("title", eElement);
                System.out.println(strList.get(0).toString());

                strList = getTagValueString("author", eElement);
                System.out.println("Author : " + strList.size());
                for(String s: strList) {
                    System.out.println(s);

                }
            }
            items.add(item);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return items;
}


private static String getTagValueString(String sTag, Element eElement) {
    String temp = "";
    StringBuffer concatTestSb = new StringBuffer();
    List<String> strList = new ArrayList<String>();
    int len = eElement.getElementsByTagName(sTag).getLength();

    try {

        for (int i = 0; i < len; i++) {
            NodeList nl = eElement.getElementsByTagName(sTag).item(i).getChildNodes();
            if (nl.getLength() > 1) {
                for (int j = 0; j < nl.getLength(); j++) {
                    concatTestSb.append(nl.item(j).getTextContent());
                }
            } else {
                temp = nl.item(0).getNodeValue();
                concatTestSb.append(temp);
                if (len > 1) {
                    concatTestSb.append("*");
                }
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return concatTestSb.toString();
}

有帮助吗？我也尝试过使用 STAX api 来解析大型文档，但这也

原文

I am trying to parse dblp.xml in java to get the author names/title/year etc, but since the file is huge (860MB), I cannot use DOM/SAX on the complete file.

So I split the file into multiple small files of around 100MB each.

Now each file contains various (thousands of) nodes like this:

<dblp>
<inproceedings mdate="2011-06-23" key="conf/aime/BianchiD95">
<author>Nadia Bianchi</author>
<author>Claudia Diamantini</author>
<title>Integration of Neural Networks and Rule Based Systems in the Interpretation of Liver     Biopsy Images.</title>
<pages>367-378</pages>
<year>1995</year>
<crossref>conf/aime/1995</crossref>
<booktitle>AIME</booktitle>
<url>db/conf/aime/aime1995.html#BianchiD95</url>
<ee>http://dx.doi.org/10.1007/3-540-60025-6_152</ee>
</inproceedings>
</dblp>

100MB should be readable in DOM, I am assuming, but the code stops after roughly 45k lines. Here is the java code I am using:

@SuppressWarnings({"unchecked", "null"})
public List<dblpModel> readConfigDOM(String configFile) {
    List<dblpModel> items = new ArrayList<dblpModel>();
    List<String> strList = null;
    dblpModel item = null;

    try {

        File fXmlFile = new File(configFile);
        DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        Document doc = dBuilder.parse(fXmlFile);
        doc.getDocumentElement().normalize();

        NodeList nList = doc.getElementsByTagName("incollection");

        for (int temp = 0; temp < nList.getLength(); temp++) {
            item = new dblpModel();
            strList = new ArrayList<String>();
            Node nNode = nList.item(temp);
            if (nNode.getNodeType() == Node.ELEMENT_NODE) {

                Element eElement = (Element) nNode;

                strList = getTagValueString("title", eElement);
                System.out.println(strList.get(0).toString());

                strList = getTagValueString("author", eElement);
                System.out.println("Author : " + strList.size());
                for(String s: strList) {
                    System.out.println(s);

                }
            }
            items.add(item);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return items;
}


private static String getTagValueString(String sTag, Element eElement) {
    String temp = "";
    StringBuffer concatTestSb = new StringBuffer();
    List<String> strList = new ArrayList<String>();
    int len = eElement.getElementsByTagName(sTag).getLength();

    try {

        for (int i = 0; i < len; i++) {
            NodeList nl = eElement.getElementsByTagName(sTag).item(i).getChildNodes();
            if (nl.getLength() > 1) {
                for (int j = 0; j < nl.getLength(); j++) {
                    concatTestSb.append(nl.item(j).getTextContent());
                }
            } else {
                temp = nl.item(0).getNodeValue();
                concatTestSb.append(temp);
                if (len > 1) {
                    concatTestSb.append("*");
                }
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return concatTestSb.toString();
}

Any help? I have tried using STAX api for parsing large documents also, but that also

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

三五鸿雁 2024-12-05 05:46:28

如果您的目标只是获取详细信息，则只需使用 BufferedReader 将文件作为文本文件读取。如果你愿意，可以添加一些正则表达式。

如果使用 mysql 是一种选择，您也许可以通过它的 XML 函数

希望这会有所帮助。

回复收藏 0 原文

影子的影子 2024-12-05 05:46:28

不要太在意 xml 格式。无论如何，它并不是很有用。只需将其作为文本文件读取并将行解析为字符串即可。然后，您可以将数据导出到 csv 并按照您想要的方式使用它。
不幸的是，XML 对于大型文档来说效率不高。我在这里为一个研究项目做了类似的事情：
http://qualityofdata.com/2011/03/27/dblp- for-sql-server/

回复收藏 0 原文

~没有更多了~