Apache POI docx 文件内容控制解析
我正在尝试解析包含内容控制字段的 docx 文件(使用像这样的窗口添加的,参考图像,我的是另一种语言)
我正在使用库 APACHE POI。我发现这个问题关于如何做到这一点。我使用了相同的代码:
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import java.util.List;
import java.util.ArrayList;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
import org.apache.xmlbeans.XmlCursor;
import javax.xml.namespace.QName;
public class ReadWordForm {
private static List<XWPFSDT> extractSDTsFromBody(XWPFDocument document) {
XWPFSDT sdt;
XmlCursor xmlcursor = document.getDocument().getBody().newCursor();
QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
List<XWPFSDT> allsdts = new ArrayList<XWPFSDT>();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (qnameSdt.equals(xmlcursor.getName())) {
if (xmlcursor.getObject() instanceof CTSdtRun) {
sdt = new XWPFSDT((CTSdtRun)xmlcursor.getObject(), document);
//System.out.println("block: " + sdt);
allsdts.add(sdt);
} else if (xmlcursor.getObject() instanceof CTSdtBlock) {
sdt = new XWPFSDT((CTSdtBlock)xmlcursor.getObject(), document);
//System.out.println("inline: " + sdt);
allsdts.add(sdt);
}
}
}
}
return allsdts;
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordDataCollectingForm.docx"));
List<XWPFSDT> allsdts = extractSDTsFromBody(document);
for (XWPFSDT sdt : allsdts) {
//System.out.println(sdt);
String title = sdt.getTitle();
String content = sdt.getContent().getText();
if (!(title == null) && !(title.isEmpty())) {
System.out.println(title + ": " + content);
} else {
System.out.println("====sdt without title====");
}
}
document.close();
}
}
问题是该代码在我的 docx 文件中看不到这些字段,直到我在 LibreOffice 中打开它并重新保存它。因此,如果文件来自 Windows,并被放入此代码中,则它不会看到这些内容控制字段。但是,如果我将文件重新保存在 LibreOffice 中(使用相同的格式),它就会开始看到这些字段,即使它丢失了一些数据(某些字段的标题和标签)。有人可以告诉我这可能是什么原因,我该如何修复它才能看到这些字段?或者也许有更简单的方法使用 docx4j ?不幸的是,互联网上没有太多关于如何使用这两个库来做到这一点的信息,至少我没有找到它。
示例文件位于 Google 磁盘上。 第一个文件不存在工作,第二个工作(在 Libre 中打开后,字段更改为选项之一)。
I'm trying to parse docx file that contains content control fields (that are added using window like this, reference image, mine is on another language)
I'm using library APACHE POI. I found this question on how to do it. I used the same code:
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import java.util.List;
import java.util.ArrayList;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
import org.apache.xmlbeans.XmlCursor;
import javax.xml.namespace.QName;
public class ReadWordForm {
private static List<XWPFSDT> extractSDTsFromBody(XWPFDocument document) {
XWPFSDT sdt;
XmlCursor xmlcursor = document.getDocument().getBody().newCursor();
QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
List<XWPFSDT> allsdts = new ArrayList<XWPFSDT>();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (qnameSdt.equals(xmlcursor.getName())) {
if (xmlcursor.getObject() instanceof CTSdtRun) {
sdt = new XWPFSDT((CTSdtRun)xmlcursor.getObject(), document);
//System.out.println("block: " + sdt);
allsdts.add(sdt);
} else if (xmlcursor.getObject() instanceof CTSdtBlock) {
sdt = new XWPFSDT((CTSdtBlock)xmlcursor.getObject(), document);
//System.out.println("inline: " + sdt);
allsdts.add(sdt);
}
}
}
}
return allsdts;
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordDataCollectingForm.docx"));
List<XWPFSDT> allsdts = extractSDTsFromBody(document);
for (XWPFSDT sdt : allsdts) {
//System.out.println(sdt);
String title = sdt.getTitle();
String content = sdt.getContent().getText();
if (!(title == null) && !(title.isEmpty())) {
System.out.println(title + ": " + content);
} else {
System.out.println("====sdt without title====");
}
}
document.close();
}
}
The problem is that this code doesn't see these fields in the my docx file until I open it in LibreOffice and re-save it. So if the file is from Windows being put into this code it doesn't see these content control fields. But if I re-save the file in the LibreOffice (using the same format) it starts to see these fields, even tho it loses some of the data (titles and tags of some fields). Can someone tell me what might be the reason of it, how do I fix that so it will see these fields? Or there's an easier way using docx4j maybe? Unfortunately there's not much info about how to do it using these 2 libs in the internet, at least I didn't find it.
Examle files are located on google disk. The first one doesn't work, the second one works (after it was opened in Libre and field was changed to one of the options).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据您上传的示例文件,您的内容控件位于表格中。您找到的代码仅直接从文档正文获取内容控件。
表格在 Word 中是个可怕的东西,因为每个表格单元格都可能包含整个文档正文。这就是表格单元格中的内容控件与主文档正文中的内容控件严格分开的原因。他们的
ooxml
类是CTSdtCell
而不是CTSdtRun
或CTSdtBlock
并且在apache poi
中类是XWPFSDTCell
而不是XWPFSDT
。如果只是读取内容,那么可以回退到
XWPFAbstractSDT
,它是XWPFSDTCell
以及XWPFSDT
的抽象父类。因此,以下代码应该有效:但正如您在代码行
sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), null, null)
中看到的,XWPFSDTCell
完全丢失它与表和表行的连接。没有直接从
XWPFTable
获取XWPFSDTCell
的正确方法。因此,如果需要将 XWPFSDTCell 连接到其表,则还需要解析 XML。这可能看起来像这样:并从文档中调用,如下所示:
使用 current
apache poi 5.2.0
可以通过以下方式从XWPFTableRow
获取XWPFSDTCell
XWPFTableRow.getTableICells。这将获取 ICell 是XWPFSDTCell
也实现的接口。因此,以下代码将从表中获取所有
XWPFSDTCell
,而不需要低级 XML 解析:According to your uploaded sample files your content controls are in a table. The code you had found only gets content controls from document body directly.
Tables are beastly things in Word as table cells may contain whole document bodies each. That's why content controls in table cells are strictly separated from content controls in main document body. Their
ooxml
class isCTSdtCell
instead ofCTSdtRun
orCTSdtBlock
and inapache poi
their class isXWPFSDTCell
instead ofXWPFSDT
.If it is only about reading the content, then one could fall back to
XWPFAbstractSDT
which is the abstract parent class ofXWPFSDTCell
as well as ofXWPFSDT
. So following code should work:But as you see in code line
sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), null, null)
, theXWPFSDTCell
totaly lost its connection to table and tablerow.There is not a proper method to get the
XWPFSDTCell
directly from aXWPFTable
. So If one would need to getXWPFSDTCell
connected to its table, then also parsing the XML is needed. This could look like so:And called from document like so:
Using curren
apache poi 5.2.0
it is possible getting theXWPFSDTCell
fromXWPFTableRow
via XWPFTableRow.getTableICells. This gets alList
of ICells which is an interface whichXWPFSDTCell
also implemets.So following code will get all
XWPFSDTCell
from tables without the need of low level XML parsing: