如何使用 Apache POI 从 MS Word 文档的文本框中获取文本？

发布于 2024-10-26 16:37:21 字数 608 浏览 9 评论 0原文

我想获取 MS Word 文档中文本框中写入的信息。我正在使用 Apache POI 来解析 word 文档。

目前，我正在迭代所有段落对象，但此段落列表不包含 TextBox 中的信息，因此我在输出中丢失了此信息。

例如

paragraph in plain text

**<some information in text box>**

one more paragraph in plain text

我想要提取的内容：

<para>paragraph in plain text</para>

<text_box>some information in text box</text_box>

<para>one more paragraph in plain text</para>

我当前得到的内容：

纯文本段落

再纯文本段落

任何人都知道如何提取信息使用 Apache POI 从文本框？

原文

I want to get information written in Textbox in an MS word document. I am using Apache POI to parse word document.

Currently I am iterating through all the Paragraph objects but this Paragraph list does not contain information from TextBox so I am missing this information in output.

e.g.

paragraph in plain text

**<some information in text box>**

one more paragraph in plain text

what i want to extract :

<para>paragraph in plain text</para>

<text_box>some information in text box</text_box>

<para>one more paragraph in plain text</para>

what I am getting currently :

paragraph in plain text

one more paragraph in plain text

Anyone knows how to extract information from text box using Apache POI?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吃颗糖壮壮胆 2024-11-02 16:37:21

这对我有用，

    private void printContentsOfTextBox(XWPFParagraph paragraph) {

        XmlObject[] textBoxObjects =  paragraph.getCTP().selectPath("
            declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' 
            declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape' 
            declare namespace v='urn:schemas-microsoft-com:vml'
            .//*/wps:txbx/w:txbxContent | .//*/v:textbox/w:txbxContent");

        for (int i =0; i < textBoxObjects.length; i++) {
            XWPFParagraph embeddedPara = null;
            try {
            XmlObject[] paraObjects = textBoxObjects[i].
                selectChildren(
                new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "p"));

            for (int j=0; j<paraObjects.length; j++) {
                embeddedPara = new XWPFParagraph(
                    CTP.Factory.parse(paraObjects[j].xmlText()), paragraph.getBody());
                //Here you have your paragraph; 
                System.out.println(embeddedPara.getText());
            } 

            } catch (XmlException e) {
            //handle
            }
        }

     }

This worked for me,

    private void printContentsOfTextBox(XWPFParagraph paragraph) {

        XmlObject[] textBoxObjects =  paragraph.getCTP().selectPath("
            declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' 
            declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape' 
            declare namespace v='urn:schemas-microsoft-com:vml'
            .//*/wps:txbx/w:txbxContent | .//*/v:textbox/w:txbxContent");

        for (int i =0; i < textBoxObjects.length; i++) {
            XWPFParagraph embeddedPara = null;
            try {
            XmlObject[] paraObjects = textBoxObjects[i].
                selectChildren(
                new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "p"));

            for (int j=0; j<paraObjects.length; j++) {
                embeddedPara = new XWPFParagraph(
                    CTP.Factory.parse(paraObjects[j].xmlText()), paragraph.getBody());
                //Here you have your paragraph; 
                System.out.println(embeddedPara.getText());
            } 

            } catch (XmlException e) {
            //handle
            }
        }

     }

回复收藏 0 原文

耳根太软 2024-11-02 16:37:21

要从 crgrep 的 Word .doc 和 .docx 文件中提取所有出现的文本，我使用了 Apache Tika 源作为如何正确使用 Apache POI API 的参考。如果您想直接使用 POI 而不依赖于 Tika，这非常有用。

对于 Word .docx 文件，请查看此 Tika 类：

org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator

如果忽略 XHTMLContentHandler 和格式化代码，您可以了解如何使用 POI 正确导航 XWPFDocument。
对于 .doc 文件，此类很有用：

org.apache.tika.parser.microsoft.WordExtractor

均来自 tika-parsers-1.x.jar。通过 Maven 依赖项访问 Tika 代码的一种简单方法是将 Tika 临时添加到 pom.xml，例如

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.7</version>
</dependency>

让您的 IDE 解析附加源并单步执行上面的类。

To extract all occurrences of text from Word .doc and .docx files for crgrep I used the Apache Tika source as a reference of how the Apache POI APIs should be correctly used. This is useful if you want to use POI directly and not depend on Tika.

For Word .docx files, take a look at this Tika class:

org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator

if you ignore XHTMLContentHandler and formatting code you can see how to navigate a XWPFDocument correctly using POI.
For .doc files this class is helpful:

org.apache.tika.parser.microsoft.WordExtractor

both from the tika-parsers-1.x.jar. An easy way to access the Tika code through your maven dependencies is add Tika temporarily to your pom.xml such as

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.7</version>
</dependency>

let your IDE resolve attached source and step into the classes above.

回复收藏 0 原文

坏尐絯 2024-11-02 16:37:21

如果您想从 docx 文件中的文本框获取文本（使用 POI 3.10-FINAL），这里是示例代码：

FileInputStream fileInputStream = new FileInputStream(inputFile);
XWPFDocument document = new XWPFDocument(OPCPackage.open(fileInputStream)); 
for (XWPFParagraph xwpfParagraph : document.getParagraphs()) {
     String text = xwpfParagraph.getParagraphText(); //here is where you receive text from textbox
}

或者您可以迭代每个
XWPFRun 在 XWPFParagraph 中并调用 toString() 方法。相同的结果。

If you want to get text from textbox in docx file (using POI 3.10-FINAL) here is sample code:

FileInputStream fileInputStream = new FileInputStream(inputFile);
XWPFDocument document = new XWPFDocument(OPCPackage.open(fileInputStream)); 
for (XWPFParagraph xwpfParagraph : document.getParagraphs()) {
     String text = xwpfParagraph.getParagraphText(); //here is where you receive text from textbox
}

Or you can iterate over each
XWPFRun in XWPFParagraph and invoke toString() method. Same result.

回复收藏 0 原文

~没有更多了~