如何使用 Apache POI 提取 docx(Word 2007 以上)
你好,我正在使用 Apache POI 3.6 我已经创建了一些代码..
XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
wordxExtractor = new XWPFWordExtractor(doc);
text = wordxExtractor.getText();
System.out.println("adding docx " + file);
d.add(new Field("content", text, Field.Store.NO, Field.Index.ANALYZED));
不幸的是,它生成了错误..
Exception in thread "main" java.lang.NoClassDefFoundError: org/dom4j/DocumentException
at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:149)
at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:136)
at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:54)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:98)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:199)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:178)
at org.apache.poi.util.PackageHelper.open(PackageHelper.java:53)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:98)
at org.apache.lucene.demo.Indexer.indexDocs(Indexer.java:153)
at org.apache.lucene.demo.Indexer.main(Indexer.java:88)
它似乎使用了构造函数
XWPFWordExtractor(OPCPackage 容器),
但不是这个 ->
XWPFWordExtractor(XWPFDocument 文档)
想知道为什么吗? 或者知道如何提取 .docx 然后将其转换为字符串?
Hai, i'm using Apache POI 3.6
I've already created some code..
XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
wordxExtractor = new XWPFWordExtractor(doc);
text = wordxExtractor.getText();
System.out.println("adding docx " + file);
d.add(new Field("content", text, Field.Store.NO, Field.Index.ANALYZED));
unfortunately, it generated error..
Exception in thread "main" java.lang.NoClassDefFoundError: org/dom4j/DocumentException
at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:149)
at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:136)
at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:54)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:98)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:199)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:178)
at org.apache.poi.util.PackageHelper.open(PackageHelper.java:53)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:98)
at org.apache.lucene.demo.Indexer.indexDocs(Indexer.java:153)
at org.apache.lucene.demo.Indexer.main(Indexer.java:88)
It seemed that it used Constructor
XWPFWordExtractor(OPCPackage container)
but not this one ->
XWPFWordExtractor(XWPFDocument document)
Any wondering why??
Or any idea how I can extract the .docx then convert it into a String?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您需要将 dom4j 库添加到您的 claspath 或项目库中
You need to Add dom4j Library to your claspath or your project libraries
看起来您的类路径上没有所有依赖项。
如果您查看 http://poi.apache.org/overview.html 您会看到dom4j 是处理 OOXML 文件时必需的库。从你得到的异常来看,你似乎没有它......如果你查看 POI 二进制下载,你应该在 ooxml-libs 子目录中找到它。
It looks like you don't have all of the dependencies on your classpath.
If you look at http://poi.apache.org/overview.html you'll see that dom4j is a required library when working with the OOXML files. From the exception you got, it seems that you don't have it... If you look in the POI binary download, you should find it in the ooxml-libs subdirectory.
你可以尝试使用 docx4j 来代替;请参阅 http://dev .plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/TextUtils.java
You could try docx4j instead; see http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/TextUtils.java