如何从 Java 中的 XML 文件中提取所有 PCDATA（文本）？

发布于 2024-11-07 04:14:02 字数 786 浏览 3 评论 0原文

我有一堆 XML 文件以及 DTD，每个文件都有一个部分。 TEXT 元素的 DTD 如下所示：

下面是示例 XML 文件的样子：

<ROOT>
  ...
  <TEXT>
  Some text that I want to extract
  <SUMMARY> Some more text </SUMMARY>
  <AGENCY> 
     An agency
     <SIGNER> Bob Smith </SIGNER>
  </AGENCY>
  </TEXT>
  ...
</ROOT>

最后，我想提取

我想要提取的一些文本更多文字某机构鲍勃·史密斯

但是，每个块在元素/顺序或向下的程度方面显然是不同的。 Java 有没有办法使用 DOM 来做到这一点？我更喜欢使用 DOM 而不是 SAX，但如果使用 SAX 更容易，那就这样吧。

提前致谢

原文

I have a bunch of XML files, along with the DTD, that each have a <TEXT> section. The DTD for the TEXT element looks like this:

Here is what an example XML file would look like:

<ROOT>
  ...
  <TEXT>
  Some text that I want to extract
  <SUMMARY> Some more text </SUMMARY>
  <AGENCY> 
     An agency
     <SIGNER> Bob Smith </SIGNER>
  </AGENCY>
  </TEXT>
  ...
</ROOT>

In the end, I want to extract

Some text that I want to extract
Some more text
An agency
Bob Smith

However, each <TEXT> block obviously is not the same in terms of the elements / ordering, or how far down you go. Is there a way in Java using DOM that I can do this? I'd prefer to use DOM over SAX, but if it's much easier to use SAX, then so be it.

Thanks in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如梦初醒的夏天 2024-11-14 04:14:03

XSLT 样式表可以工作：

更新 #2：我怀疑这是否适合您，因为您实际上使用的是 SGML 而不是 XML。问题是您问题中的元素声明具有 XML 中不允许的标记最小化。

更新：修改了 XML 输入和 XSLT，仅显示 < 中的文本;TEXT> 结构。

XML 输入

<ROOT>
  <IGNORE>ignore this data</IGNORE>
  <TEXT>
    Some text that I want to extract
    <SUMMARY> Some more text </SUMMARY>
    <AGENCY> 
      An agency
      <SIGNER> Bob Smith </SIGNER>
    </AGENCY>
  </TEXT>
  <IGNORE>ignore this data</IGNORE>
</ROOT>

XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:value-of select="normalize-space(/ROOT/TEXT)"/>
  </xsl:template>

</xsl:stylesheet>

输出

我想提取一些文本
更多文本鲍勃·史密斯代理机构

注意：仅当 TEXT 是 ROOT 的子级时，此 XSLT 才有效。如果 TEXT 可能嵌套得更深，您可以将“select”更改为 select="normalize-space(//TEXT)"。

An XSLT stylesheet would work:

UPDATE #2: I doubt this would work for you since you're actually using SGML and not XML. The give-away is that the element declaration you have in your question has tag minimization which is not allowed in XML.

UPDATE: Modified the XML input and XSLT to only display the text in the <TEXT> structure.

XML INPUT

<ROOT>
  <IGNORE>ignore this data</IGNORE>
  <TEXT>
    Some text that I want to extract
    <SUMMARY> Some more text </SUMMARY>
    <AGENCY> 
      An agency
      <SIGNER> Bob Smith </SIGNER>
    </AGENCY>
  </TEXT>
  <IGNORE>ignore this data</IGNORE>
</ROOT>

XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:value-of select="normalize-space(/ROOT/TEXT)"/>
  </xsl:template>

</xsl:stylesheet>

OUTPUT

Some text that I want to extract Some
more text An agency Bob Smith

Note: This XSLT only works if TEXT is a child of ROOT. If TEXT might be nested more deeply, you can change the "select" to select="normalize-space(//TEXT)".

回复收藏 0 原文

月下凄凉 2024-11-14 04:14:03

我不是 SAX 的忠实粉丝，但对于这个，我认为它会很好用。

只需定义一个 sax 处理程序，但仅使用 characters 方法。然后只需将接收到的字符放入 StringBuilder 中即可。

public class textExtractor extends DefaultHandler {

  private StringBuilder sb = new StringBuilder();

  public void characters(char[] ch, int start, int length){
    for (int i=0; i<length; i++){
      sb.append(ch[i]);
    }
  }

  public String getText(){
    return sb.toString();
  }

}

I'm not a big fan of SAX, but for this, I think it would work nicely.

Just define a sax handler, but only use the characters method. Then just throw the received characters in a StringBuilder and you're done.

public class textExtractor extends DefaultHandler {

  private StringBuilder sb = new StringBuilder();

  public void characters(char[] ch, int start, int length){
    for (int i=0; i<length; i++){
      sb.append(ch[i]);
    }
  }

  public String getText(){
    return sb.toString();
  }

}

回复收藏 0 原文

~没有更多了~