如何从 Java 中的 XML 文件中提取所有 PCDATA(文本)?
我有一堆 XML 文件以及 DTD,每个文件都有一个
部分。 TEXT
元素的 DTD 如下所示:
下面是示例 XML 文件的样子:
<ROOT>
...
<TEXT>
Some text that I want to extract
<SUMMARY> Some more text </SUMMARY>
<AGENCY>
An agency
<SIGNER> Bob Smith </SIGNER>
</AGENCY>
</TEXT>
...
</ROOT>
最后,我想提取
我想要提取的一些文本 更多文字 某机构 鲍勃·史密斯
但是,每个
块在元素/顺序或向下的程度方面显然是不同的。 Java 有没有办法使用 DOM 来做到这一点?我更喜欢使用 DOM 而不是 SAX,但如果使用 SAX 更容易,那就这样吧。
提前致谢
I have a bunch of XML files, along with the DTD, that each have a <TEXT>
section. The DTD for the TEXT
element looks like this:
<!ELEMENT TEXT - - (AGENCY* | ACTION* | SUMMARY* | DATE* | FOOTNAME* | FURTHER* | SIGNER* | SIGNJOB* | FRFILING* | BILLING* | FOOTNOTE* | FOOTCITE* | TABLE* | ADDRESS* | IMPORT* | #PCDATA)+ >
Here is what an example XML file would look like:
<ROOT>
...
<TEXT>
Some text that I want to extract
<SUMMARY> Some more text </SUMMARY>
<AGENCY>
An agency
<SIGNER> Bob Smith </SIGNER>
</AGENCY>
</TEXT>
...
</ROOT>
In the end, I want to extract
Some text that I want to extract
Some more text
An agency
Bob Smith
However, each <TEXT>
block obviously is not the same in terms of the elements / ordering, or how far down you go. Is there a way in Java using DOM that I can do this? I'd prefer to use DOM over SAX, but if it's much easier to use SAX, then so be it.
Thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
XSLT 样式表可以工作:
更新 #2:我怀疑这是否适合您,因为您实际上使用的是 SGML 而不是 XML。问题是您问题中的元素声明具有 XML 中不允许的标记最小化。
更新:修改了 XML 输入和 XSLT,仅显示
< 中的文本;TEXT>
结构。XML 输入
XSLT
输出
注意:仅当 TEXT 是 ROOT 的子级时,此 XSLT 才有效。如果 TEXT 可能嵌套得更深,您可以将“select”更改为
select="normalize-space(//TEXT)"
。An XSLT stylesheet would work:
UPDATE #2: I doubt this would work for you since you're actually using SGML and not XML. The give-away is that the element declaration you have in your question has tag minimization which is not allowed in XML.
UPDATE: Modified the XML input and XSLT to only display the text in the
<TEXT>
structure.XML INPUT
XSLT
OUTPUT
Note: This XSLT only works if TEXT is a child of ROOT. If TEXT might be nested more deeply, you can change the "select" to
select="normalize-space(//TEXT)"
.我不是 SAX 的忠实粉丝,但对于这个,我认为它会很好用。
只需定义一个 sax 处理程序,但仅使用
characters
方法。然后只需将接收到的字符放入StringBuilder
中即可。I'm not a big fan of SAX, but for this, I think it would work nicely.
Just define a sax handler, but only use the
characters
method. Then just throw the received characters in aStringBuilder
and you're done.