使用 java DOM/SAX 解析 dblp.xml

发布于 2024-11-28 05:46:28 字数 3283 浏览 0 评论 0原文

我试图在java中解析dblp.xml以获取作者姓名/标题/年份等,但由于文件很大(860MB),我无法在完整文件上使用DOM/SAX。

所以我把文件分成了多个小文件,每个小文件大约100MB。

现在每个文件都包含各种(数千个)节点,如下所示:

<dblp>
<inproceedings mdate="2011-06-23" key="conf/aime/BianchiD95">
<author>Nadia Bianchi</author>
<author>Claudia Diamantini</author>
<title>Integration of Neural Networks and Rule Based Systems in the Interpretation of Liver     Biopsy Images.</title>
<pages>367-378</pages>
<year>1995</year>
<crossref>conf/aime/1995</crossref>
<booktitle>AIME</booktitle>
<url>db/conf/aime/aime1995.html#BianchiD95</url>
<ee>http://dx.doi.org/10.1007/3-540-60025-6_152</ee>
</inproceedings>
</dblp>

我假设 DOM 中应该可读 100MB,但代码在大约 45k 行后停止。这是我正在使用的java代码:

@SuppressWarnings({"unchecked", "null"})
public List<dblpModel> readConfigDOM(String configFile) {
    List<dblpModel> items = new ArrayList<dblpModel>();
    List<String> strList = null;
    dblpModel item = null;

    try {

        File fXmlFile = new File(configFile);
        DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        Document doc = dBuilder.parse(fXmlFile);
        doc.getDocumentElement().normalize();

        NodeList nList = doc.getElementsByTagName("incollection");

        for (int temp = 0; temp < nList.getLength(); temp++) {
            item = new dblpModel();
            strList = new ArrayList<String>();
            Node nNode = nList.item(temp);
            if (nNode.getNodeType() == Node.ELEMENT_NODE) {

                Element eElement = (Element) nNode;

                strList = getTagValueString("title", eElement);
                System.out.println(strList.get(0).toString());

                strList = getTagValueString("author", eElement);
                System.out.println("Author : " + strList.size());
                for(String s: strList) {
                    System.out.println(s);

                }
            }
            items.add(item);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return items;
}


private static String getTagValueString(String sTag, Element eElement) {
    String temp = "";
    StringBuffer concatTestSb = new StringBuffer();
    List<String> strList = new ArrayList<String>();
    int len = eElement.getElementsByTagName(sTag).getLength();

    try {

        for (int i = 0; i < len; i++) {
            NodeList nl = eElement.getElementsByTagName(sTag).item(i).getChildNodes();
            if (nl.getLength() > 1) {
                for (int j = 0; j < nl.getLength(); j++) {
                    concatTestSb.append(nl.item(j).getTextContent());
                }
            } else {
                temp = nl.item(0).getNodeValue();
                concatTestSb.append(temp);
                if (len > 1) {
                    concatTestSb.append("*");
                }
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return concatTestSb.toString();
}

有帮助吗?我也尝试过使用 STAX api 来解析大型文档,但这也

I am trying to parse dblp.xml in java to get the author names/title/year etc, but since the file is huge (860MB), I cannot use DOM/SAX on the complete file.

So I split the file into multiple small files of around 100MB each.

Now each file contains various (thousands of) nodes like this:

<dblp>
<inproceedings mdate="2011-06-23" key="conf/aime/BianchiD95">
<author>Nadia Bianchi</author>
<author>Claudia Diamantini</author>
<title>Integration of Neural Networks and Rule Based Systems in the Interpretation of Liver     Biopsy Images.</title>
<pages>367-378</pages>
<year>1995</year>
<crossref>conf/aime/1995</crossref>
<booktitle>AIME</booktitle>
<url>db/conf/aime/aime1995.html#BianchiD95</url>
<ee>http://dx.doi.org/10.1007/3-540-60025-6_152</ee>
</inproceedings>
</dblp>

100MB should be readable in DOM, I am assuming, but the code stops after roughly 45k lines. Here is the java code I am using:

@SuppressWarnings({"unchecked", "null"})
public List<dblpModel> readConfigDOM(String configFile) {
    List<dblpModel> items = new ArrayList<dblpModel>();
    List<String> strList = null;
    dblpModel item = null;

    try {

        File fXmlFile = new File(configFile);
        DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
        Document doc = dBuilder.parse(fXmlFile);
        doc.getDocumentElement().normalize();

        NodeList nList = doc.getElementsByTagName("incollection");

        for (int temp = 0; temp < nList.getLength(); temp++) {
            item = new dblpModel();
            strList = new ArrayList<String>();
            Node nNode = nList.item(temp);
            if (nNode.getNodeType() == Node.ELEMENT_NODE) {

                Element eElement = (Element) nNode;

                strList = getTagValueString("title", eElement);
                System.out.println(strList.get(0).toString());

                strList = getTagValueString("author", eElement);
                System.out.println("Author : " + strList.size());
                for(String s: strList) {
                    System.out.println(s);

                }
            }
            items.add(item);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return items;
}


private static String getTagValueString(String sTag, Element eElement) {
    String temp = "";
    StringBuffer concatTestSb = new StringBuffer();
    List<String> strList = new ArrayList<String>();
    int len = eElement.getElementsByTagName(sTag).getLength();

    try {

        for (int i = 0; i < len; i++) {
            NodeList nl = eElement.getElementsByTagName(sTag).item(i).getChildNodes();
            if (nl.getLength() > 1) {
                for (int j = 0; j < nl.getLength(); j++) {
                    concatTestSb.append(nl.item(j).getTextContent());
                }
            } else {
                temp = nl.item(0).getNodeValue();
                concatTestSb.append(temp);
                if (len > 1) {
                    concatTestSb.append("*");
                }
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return concatTestSb.toString();
}

Any help? I have tried using STAX api for parsing large documents also, but that also

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

三五鸿雁 2024-12-05 05:46:28

如果您的目标只是获取详细信息,则只需使用 BufferedReader 将文件作为文本文件读取。如果你愿意,可以添加一些正则表达式。

如果使用 mysql 是一种选择,您也许可以通过它的 XML 函数

希望这会有所帮助。

If you goal is to just get the details out, the just use a BufferedReader to read the file as a text file. If you want, throw in some regex.

if using mysql is an option, you may be able to get it to do the heavy lifting through it's XML Functions

Hope this helps.

影子的影子 2024-12-05 05:46:28

不要太在意 xml 格式。无论如何,它并不是很有用。只需将其作为文本文件读取并将行解析为字符串即可。然后,您可以将数据导出到 csv 并按照您想要的方式使用它。
不幸的是,XML 对于大型文档来说效率不高。我在这里为一个研究项目做了类似的事情:
http://qualityofdata.com/2011/03/27/dblp- for-sql-server/

Don't fuss too much about the xml format. It is not terribly useful anyway. Just read it as text file and parse the lines as string. You can then export the data to a csv and use it the way you want from that point.
Unfortunately xml is not very efficient for large documents. I did something similar here for a research project:
http://qualityofdata.com/2011/03/27/dblp-for-sql-server/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文