Apache POI docx 文件内容控制解析

发布于 2025-01-10 15:00:15 字数 2652 浏览 5 评论 0原文

我正在尝试解析包含内容控制字段的 docx 文件（使用像这样的窗口添加的，参考图像，我的是另一种语言）

我正在使用库 APACHE POI。我发现这个问题关于如何做到这一点。我使用了相同的代码：

import java.io.FileInputStream;

import org.apache.poi.xwpf.usermodel.*;

import java.util.List;
import java.util.ArrayList;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
import org.apache.xmlbeans.XmlCursor;
import javax.xml.namespace.QName;

public class ReadWordForm {

 private static List<XWPFSDT> extractSDTsFromBody(XWPFDocument document) {
  XWPFSDT sdt;
  XmlCursor xmlcursor = document.getDocument().getBody().newCursor();
  QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
  List<XWPFSDT> allsdts = new ArrayList<XWPFSDT>();
  while (xmlcursor.hasNextToken()) {
   XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
   if (tokentype.isStart()) {
    if (qnameSdt.equals(xmlcursor.getName())) {
     if (xmlcursor.getObject() instanceof CTSdtRun) {
      sdt = new XWPFSDT((CTSdtRun)xmlcursor.getObject(), document); 
//System.out.println("block: " + sdt);
      allsdts.add(sdt);
     } else if (xmlcursor.getObject() instanceof CTSdtBlock) {
      sdt = new XWPFSDT((CTSdtBlock)xmlcursor.getObject(), document); 
//System.out.println("inline: " + sdt);
      allsdts.add(sdt);
     }
    } 
   }
  }
  return allsdts;
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("WordDataCollectingForm.docx"));

  List<XWPFSDT> allsdts = extractSDTsFromBody(document);

  for (XWPFSDT sdt : allsdts) {
//System.out.println(sdt);
   String title = sdt.getTitle();
   String content = sdt.getContent().getText();
   if (!(title == null) && !(title.isEmpty())) {
    System.out.println(title + ": " + content);
   } else {
    System.out.println("====sdt without title====");
   }
  }

  document.close();
 }
}

问题是该代码在我的 docx 文件中看不到这些字段，直到我在 LibreOffice 中打开它并重新保存它。因此，如果文件来自 Windows，并被放入此代码中，则它不会看到这些内容控制字段。但是，如果我将文件重新保存在 LibreOffice 中（使用相同的格式），它就会开始看到这些字段，即使它丢失了一些数据（某些字段的标题和标签）。有人可以告诉我这可能是什么原因，我该如何修复它才能看到这些字段？或者也许有更简单的方法使用 docx4j ？不幸的是，互联网上没有太多关于如何使用这两个库来做到这一点的信息，至少我没有找到它。

示例文件位于 Google 磁盘上。第一个文件不存在工作，第二个工作（在 Libre 中打开后，字段更改为选项之一）。

原文

I'm trying to parse docx file that contains content control fields (that are added using window like this, reference image, mine is on another language)

I'm using library APACHE POI. I found this question on how to do it. I used the same code:

import java.io.FileInputStream;

import org.apache.poi.xwpf.usermodel.*;

import java.util.List;
import java.util.ArrayList;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
import org.apache.xmlbeans.XmlCursor;
import javax.xml.namespace.QName;

public class ReadWordForm {

 private static List<XWPFSDT> extractSDTsFromBody(XWPFDocument document) {
  XWPFSDT sdt;
  XmlCursor xmlcursor = document.getDocument().getBody().newCursor();
  QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
  List<XWPFSDT> allsdts = new ArrayList<XWPFSDT>();
  while (xmlcursor.hasNextToken()) {
   XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
   if (tokentype.isStart()) {
    if (qnameSdt.equals(xmlcursor.getName())) {
     if (xmlcursor.getObject() instanceof CTSdtRun) {
      sdt = new XWPFSDT((CTSdtRun)xmlcursor.getObject(), document); 
//System.out.println("block: " + sdt);
      allsdts.add(sdt);
     } else if (xmlcursor.getObject() instanceof CTSdtBlock) {
      sdt = new XWPFSDT((CTSdtBlock)xmlcursor.getObject(), document); 
//System.out.println("inline: " + sdt);
      allsdts.add(sdt);
     }
    } 
   }
  }
  return allsdts;
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("WordDataCollectingForm.docx"));

  List<XWPFSDT> allsdts = extractSDTsFromBody(document);

  for (XWPFSDT sdt : allsdts) {
//System.out.println(sdt);
   String title = sdt.getTitle();
   String content = sdt.getContent().getText();
   if (!(title == null) && !(title.isEmpty())) {
    System.out.println(title + ": " + content);
   } else {
    System.out.println("====sdt without title====");
   }
  }

  document.close();
 }
}

The problem is that this code doesn't see these fields in the my docx file until I open it in LibreOffice and re-save it. So if the file is from Windows being put into this code it doesn't see these content control fields. But if I re-save the file in the LibreOffice (using the same format) it starts to see these fields, even tho it loses some of the data (titles and tags of some fields). Can someone tell me what might be the reason of it, how do I fix that so it will see these fields? Or there's an easier way using docx4j maybe? Unfortunately there's not much info about how to do it using these 2 libs in the internet, at least I didn't find it.

Examle files are located on google disk. The first one doesn't work, the second one works (after it was opened in Libre and field was changed to one of the options).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

生来就爱笑 2025-01-17 15:00:15

根据您上传的示例文件，您的内容控件位于表格中。您找到的代码仅直接从文档正文获取内容控件。

表格在 Word 中是个可怕的东西，因为每个表格单元格都可能包含整个文档正文。这就是表格单元格中的内容控件与主文档正文中的内容控件严格分开的原因。他们的 ooxml 类是 CTSdtCell 而不是 CTSdtRun 或 CTSdtBlock 并且在 apache poi 中类是 XWPFSDTCell 而不是 XWPFSDT。

如果只是读取内容，那么可以回退到 XWPFAbstractSDT，它是 XWPFSDTCell 以及 XWPFSDT 的抽象父类。因此，以下代码应该有效：

 private static List<XWPFAbstractSDT> extractSDTsFromBody(XWPFDocument document) {
  XWPFAbstractSDT sdt;
  XmlCursor xmlcursor = document.getDocument().getBody().newCursor();
  QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
  List<XWPFAbstractSDT> allsdts = new ArrayList<XWPFAbstractSDT>();
  while (xmlcursor.hasNextToken()) {
   XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
   if (tokentype.isStart()) {
    if (qnameSdt.equals(xmlcursor.getName())) {
//System.out.println(xmlcursor.getObject().getClass().getName());
     if (xmlcursor.getObject() instanceof CTSdtRun) {
      sdt = new XWPFSDT((CTSdtRun)xmlcursor.getObject(), document); 
//System.out.println("block: " + sdt);
      allsdts.add(sdt);
     } else if (xmlcursor.getObject() instanceof CTSdtBlock) {
      sdt = new XWPFSDT((CTSdtBlock)xmlcursor.getObject(), document); 
//System.out.println("inline: " + sdt);
      allsdts.add(sdt);
     } else if (xmlcursor.getObject() instanceof CTSdtCell) {
      sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), null, null); 
//System.out.println("cell: " + sdt);
      allsdts.add(sdt);
     }
    } 
   }
  }
  return allsdts;
 }

但正如您在代码行 sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), null, null) 中看到的，XWPFSDTCell 完全丢失它与表和表行的连接。

没有直接从 XWPFTable 获取 XWPFSDTCell 的正确方法。因此，如果需要将 XWPFSDTCell 连接到其表，则还需要解析 XML。这可能看起来像这样：

 private static List<XWPFSDTCell> extractSDTsFromTableRow(XWPFTableRow row) {
  XWPFSDTCell sdt;
  XmlCursor xmlcursor = row.getCtRow().newCursor();
  QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
  QName qnameTr = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "tr", "w");
  List<XWPFSDTCell> allsdts = new ArrayList<XWPFSDTCell>();
  while (xmlcursor.hasNextToken()) {
   XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
   if (tokentype.isStart()) {
    if (qnameSdt.equals(xmlcursor.getName())) {
//System.out.println(xmlcursor.getObject().getClass().getName());
     if (xmlcursor.getObject() instanceof CTSdtCell) {
      sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), row, row.getTable().getBody()); 
//System.out.println("cell: " + sdt);
      allsdts.add(sdt);
     }
    } 
   } else if (tokentype.isEnd()) {
    //we have to check whether we are at the end of the table row
    xmlcursor.push();
    xmlcursor.toParent();  
    if (qnameTr.equals(xmlcursor.getName())) {
     break;
    }
    xmlcursor.pop();
   }
  }
  return allsdts;
 }

并从文档中调用，如下所示：

...
  for (XWPFTable table : document.getTables()) {
   for (XWPFTableRow row : table.getRows()) {
    List<XWPFSDTCell> allTrsdts = extractSDTsFromTableRow(row);
    for (XWPFSDTCell sdt : allTrsdts) {
//System.out.println(sdt);
     String title = sdt.getTitle();
     String content = sdt.getContent().getText();
     if (!(title == null) && !(title.isEmpty())) {
      System.out.println(title + ": " + content);
     } else {
      System.out.println("====sdt without title====");
      System.out.println(content);
     }
    }
   }
  }
...

使用 current apache poi 5.2.0 可以通过以下方式从 XWPFTableRow 获取 XWPFSDTCell XWPFTableRow.getTableICells。这将获取 ICell 是 XWPFSDTCell 也实现的接口。

因此，以下代码将从表中获取所有 XWPFSDTCell ，而不需要低级 XML 解析：

...
  for (XWPFTable table : document.getTables()) {
   for (XWPFTableRow row : table.getRows()) {
    for (ICell iCell : row.getTableICells()) {
     if (iCell instanceof XWPFSDTCell) {
      XWPFSDTCell sdt = (XWPFSDTCell)iCell;
//System.out.println(sdt);
      String title = sdt.getTitle();
      String content = sdt.getContent().getText();
      if (!(title == null) && !(title.isEmpty())) {
       System.out.println(title + ": " + content);
      } else {
       System.out.println("====sdt without title====");
       System.out.println(content);
      }
     }
    }
   }
  }
...

According to your uploaded sample files your content controls are in a table. The code you had found only gets content controls from document body directly.

Tables are beastly things in Word as table cells may contain whole document bodies each. That's why content controls in table cells are strictly separated from content controls in main document body. Their ooxml class is CTSdtCell instead of CTSdtRun or CTSdtBlock and in apache poi their class is XWPFSDTCell instead of XWPFSDT.

If it is only about reading the content, then one could fall back to XWPFAbstractSDT which is the abstract parent class of XWPFSDTCell as well as of XWPFSDT. So following code should work:

 private static List<XWPFAbstractSDT> extractSDTsFromBody(XWPFDocument document) {
  XWPFAbstractSDT sdt;
  XmlCursor xmlcursor = document.getDocument().getBody().newCursor();
  QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
  List<XWPFAbstractSDT> allsdts = new ArrayList<XWPFAbstractSDT>();
  while (xmlcursor.hasNextToken()) {
   XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
   if (tokentype.isStart()) {
    if (qnameSdt.equals(xmlcursor.getName())) {
//System.out.println(xmlcursor.getObject().getClass().getName());
     if (xmlcursor.getObject() instanceof CTSdtRun) {
      sdt = new XWPFSDT((CTSdtRun)xmlcursor.getObject(), document); 
//System.out.println("block: " + sdt);
      allsdts.add(sdt);
     } else if (xmlcursor.getObject() instanceof CTSdtBlock) {
      sdt = new XWPFSDT((CTSdtBlock)xmlcursor.getObject(), document); 
//System.out.println("inline: " + sdt);
      allsdts.add(sdt);
     } else if (xmlcursor.getObject() instanceof CTSdtCell) {
      sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), null, null); 
//System.out.println("cell: " + sdt);
      allsdts.add(sdt);
     }
    } 
   }
  }
  return allsdts;
 }

But as you see in code line sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), null, null), the XWPFSDTCell totaly lost its connection to table and tablerow.

There is not a proper method to get the XWPFSDTCell directly from a XWPFTable. So If one would need to get XWPFSDTCell connected to its table, then also parsing the XML is needed. This could look like so:

 private static List<XWPFSDTCell> extractSDTsFromTableRow(XWPFTableRow row) {
  XWPFSDTCell sdt;
  XmlCursor xmlcursor = row.getCtRow().newCursor();
  QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
  QName qnameTr = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "tr", "w");
  List<XWPFSDTCell> allsdts = new ArrayList<XWPFSDTCell>();
  while (xmlcursor.hasNextToken()) {
   XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
   if (tokentype.isStart()) {
    if (qnameSdt.equals(xmlcursor.getName())) {
//System.out.println(xmlcursor.getObject().getClass().getName());
     if (xmlcursor.getObject() instanceof CTSdtCell) {
      sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), row, row.getTable().getBody()); 
//System.out.println("cell: " + sdt);
      allsdts.add(sdt);
     }
    } 
   } else if (tokentype.isEnd()) {
    //we have to check whether we are at the end of the table row
    xmlcursor.push();
    xmlcursor.toParent();  
    if (qnameTr.equals(xmlcursor.getName())) {
     break;
    }
    xmlcursor.pop();
   }
  }
  return allsdts;
 }

And called from document like so:

...
  for (XWPFTable table : document.getTables()) {
   for (XWPFTableRow row : table.getRows()) {
    List<XWPFSDTCell> allTrsdts = extractSDTsFromTableRow(row);
    for (XWPFSDTCell sdt : allTrsdts) {
//System.out.println(sdt);
     String title = sdt.getTitle();
     String content = sdt.getContent().getText();
     if (!(title == null) && !(title.isEmpty())) {
      System.out.println(title + ": " + content);
     } else {
      System.out.println("====sdt without title====");
      System.out.println(content);
     }
    }
   }
  }
...

Using curren apache poi 5.2.0 it is possible getting the XWPFSDTCell from XWPFTableRow via XWPFTableRow.getTableICells. This gets al List of ICells which is an interface which XWPFSDTCell also implemets.

So following code will get all XWPFSDTCell from tables without the need of low level XML parsing:

...
  for (XWPFTable table : document.getTables()) {
   for (XWPFTableRow row : table.getRows()) {
    for (ICell iCell : row.getTableICells()) {
     if (iCell instanceof XWPFSDTCell) {
      XWPFSDTCell sdt = (XWPFSDTCell)iCell;
//System.out.println(sdt);
      String title = sdt.getTitle();
      String content = sdt.getContent().getText();
      if (!(title == null) && !(title.isEmpty())) {
       System.out.println(title + ": " + content);
      } else {
       System.out.println("====sdt without title====");
       System.out.println(content);
      }
     }
    }
   }
  }
...

回复收藏 0 原文

~没有更多了~