用Java解析文档结构

发布于 2024-10-16 18:17:08 字数 390 浏览 5 评论 0原文

我们需要使用 Java 从给定的文本文档中获取树状结构。使用的文件类型应该是通用且开放的（rtf、odt，...）。目前我们使用 Apache Tika 解析多个文档中的纯文本。

我们应该使用什么文件类型和 API 才能最可靠地解析正确的结构？如果 Tika 可以做到这一点，我会很高兴看到任何演示。

例如，我们应该从给定的文档中获取此类数据：

Main Heading
  Heading 1
    Heading 1.1
  Heading 2
    Heading 2.2

Main Heading 是论文的标题。论文有两个主标题，标题 1 和标题 2，并且都有一个副标题。我们还应该获取每个标题（段落文本）下的内容。

任何帮助表示赞赏。

原文

We need to get tree like structure from a given text document using Java. Used file type should be common and open (rtf, odt, ...). Currently we use Apache Tika to parse plain text from multiple documents.

What file type and API we should use so that we could most reliably get the correct structure parsed? If this is possible with Tika, I would be happy to see any demonstrations.

For example, we should get this kind of data from the given document:

Main Heading
  Heading 1
    Heading 1.1
  Heading 2
    Heading 2.2

Main Heading is the title of the paper. Paper has two main headings, Heading 1 and Heading 2 and they both have one subheadings. We should also get contents under each heading (paragraph text).

Any help is appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

世界等同你 2024-10-23 18:17:08

OpenDocument (.odt) 实际上是一个包含多个 xml 文件的 zip 包。 Content.xml 包含文档的实际文本内容。我们对标题感兴趣，它们可以在 text:h 标签内找到。详细了解 ODT。

我找到了一个使用 QueryPath 从 .odt 文件中提取标题的实现。

由于最初的问题是关于 Java 的，所以就在这里。首先，我们需要使用 ZipFile 访问 content.xml。然后我们使用SAX从content.xml中解析出xml内容。示例代码简单地打印出所有标题：

Test3.odt
content.xml
3764
1 My New Great Paper
2 Abstract
2 Introduction
2 Content
3 More content
3 Even more
2 Conclusions

示例代码：
    public void printHeadingsOfOdtFIle(File odtFile) {

    try {

        ZipFile zFile = new ZipFile(odtFile);
        System.out.println(zFile.getName());

        ZipEntry contentFile = zFile.getEntry("content.xml");

        System.out.println(contentFile.getName());
        System.out.println(contentFile.getSize());
        XMLReader xr = XMLReaderFactory.createXMLReader();
        OdtDocumentContentHandler handler = new OdtDocumentContentHandler();
        xr.setContentHandler(handler);

        xr.parse(new InputSource(zFile.getInputStream(contentFile)));

    } catch (Exception e) {

        e.printStackTrace();

    }

}

public static void main(String[] args) {

    new OdtDocumentStructureExtractor().printHeadingsOfOdtFIle(new File("Test3.odt"));

}
使用的ContentHandler的相关部分如下所示：

    @Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {

    temp = "";

    if("text:h".equals(qName)) {

        String headingLevel = atts.getValue("text:outline-level");

        if(headingLevel != null) {

            System.out.print(headingLevel + " ");

        }

    }

}

@Override
public void characters(char[] ch, int start, int length) throws SAXException {

    char[] subArray = new char[length];
    System.arraycopy(ch, start, subArray, 0, length);
    temp = new String(subArray);

    fullText.append(temp);
}

@Override
public void endElement(String uri, String localName, String qName) throws SAXException {

    if("text:h".equals(qName)) {

        System.out.println(temp);

    }

}

OpenDocument (.odt) is practically a zip package containing multiple xml files. Content.xml contains the actual textual content of the document. We are interested in headings and they can be found inside text:h tags. Read more about ODT.

I found an implementation for extracting headings from .odt files with QueryPath.

Since the original question was about Java, here it is. First we need to get access to content.xml by using ZipFile. Then we use SAX to parse xml content out of content.xml. Sample code simply prints out all the headings:

Test3.odt
content.xml
3764
1 My New Great Paper
2 Abstract
2 Introduction
2 Content
3 More content
3 Even more
2 Conclusions


Sample code:    public void printHeadingsOfOdtFIle(File odtFile) {

    try {

        ZipFile zFile = new ZipFile(odtFile);
        System.out.println(zFile.getName());

        ZipEntry contentFile = zFile.getEntry("content.xml");

        System.out.println(contentFile.getName());
        System.out.println(contentFile.getSize());
        XMLReader xr = XMLReaderFactory.createXMLReader();
        OdtDocumentContentHandler handler = new OdtDocumentContentHandler();
        xr.setContentHandler(handler);

        xr.parse(new InputSource(zFile.getInputStream(contentFile)));

    } catch (Exception e) {

        e.printStackTrace();

    }

}

public static void main(String[] args) {

    new OdtDocumentStructureExtractor().printHeadingsOfOdtFIle(new File("Test3.odt"));

}
Relevant parts of used ContentHandler look like this:

    @Override
public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {

    temp = "";

    if("text:h".equals(qName)) {

        String headingLevel = atts.getValue("text:outline-level");

        if(headingLevel != null) {

            System.out.print(headingLevel + " ");

        }

    }

}

@Override
public void characters(char[] ch, int start, int length) throws SAXException {

    char[] subArray = new char[length];
    System.arraycopy(ch, start, subArray, 0, length);
    temp = new String(subArray);

    fullText.append(temp);
}

@Override
public void endElement(String uri, String localName, String qName) throws SAXException {

    if("text:h".equals(qName)) {

        System.out.println(temp);

    }

}

回复收藏 0 原文

~没有更多了~