如何使用 Apache POI 从 MS Word 文档的文本框中获取文本?
我想获取 MS Word 文档中文本框中写入的信息。我正在使用 Apache POI 来解析 word 文档。
目前,我正在迭代所有段落对象,但此段落列表不包含 TextBox 中的信息,因此我在输出中丢失了此信息。
例如
paragraph in plain text
**<some information in text box>**
one more paragraph in plain text
我想要提取的内容:
<para>paragraph in plain text</para>
<text_box>some information in text box</text_box>
<para>one more paragraph in plain text</para>
我当前得到的内容:
纯文本段落
再纯文本段落
任何人都知道如何提取信息使用 Apache POI 从文本框?
I want to get information written in Textbox in an MS word document. I am using Apache POI to parse word document.
Currently I am iterating through all the Paragraph objects but this Paragraph list does not contain information from TextBox so I am missing this information in output.
e.g.
paragraph in plain text
**<some information in text box>**
one more paragraph in plain text
what i want to extract :
<para>paragraph in plain text</para>
<text_box>some information in text box</text_box>
<para>one more paragraph in plain text</para>
what I am getting currently :
paragraph in plain text
one more paragraph in plain text
Anyone knows how to extract information from text box using Apache POI?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这对我有用,
This worked for me,
要从 crgrep 的 Word .doc 和 .docx 文件中提取所有出现的文本,我使用了 Apache Tika 源作为如何正确使用 Apache POI API 的参考。如果您想直接使用 POI 而不依赖于 Tika,这非常有用。
对于 Word .docx 文件,请查看此 Tika 类:
如果忽略
XHTMLContentHandler
和格式化代码,您可以了解如何使用 POI 正确导航XWPFDocument
。对于 .doc 文件,此类很有用:
均来自
tika-parsers-1.x.jar
。通过 Maven 依赖项访问 Tika 代码的一种简单方法是将 Tika 临时添加到 pom.xml,例如让您的 IDE 解析附加源并单步执行上面的类。
To extract all occurrences of text from Word .doc and .docx files for crgrep I used the Apache Tika source as a reference of how the Apache POI APIs should be correctly used. This is useful if you want to use POI directly and not depend on Tika.
For Word .docx files, take a look at this Tika class:
if you ignore
XHTMLContentHandler
and formatting code you can see how to navigate aXWPFDocument
correctly using POI.For .doc files this class is helpful:
both from the
tika-parsers-1.x.jar
. An easy way to access the Tika code through your maven dependencies is add Tika temporarily to your pom.xml such aslet your IDE resolve attached source and step into the classes above.
如果您想从 docx 文件中的文本框获取文本(使用 POI 3.10-FINAL),这里是示例代码:
或者您可以迭代每个
XWPFRun 在 XWPFParagraph 中并调用 toString() 方法。相同的结果。
If you want to get text from textbox in docx file (using POI 3.10-FINAL) here is sample code:
Or you can iterate over each
XWPFRun in XWPFParagraph and invoke toString() method. Same result.