Java - 读取 XML 并保留所有实体

发布于 2024-12-03 20:44:00 字数 2253 浏览 0 评论 0原文

我想使用 SAX 或 StAX 读取 XHTML 文件,无论哪种效果最好。 但我不希望实体被解决、替换或类似的事情。 理想情况下,它们应该保持原样。 我不想使用 DTD。

这是一个(可执行文件,使用 Scala 2.8.x)示例:

import javax.xml.stream._
import javax.xml.stream.events._
import java.io._

println("StAX Test - "+args(0)+"\n")
val factory = XMLInputFactory.newInstance
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false)
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false)

println("------")
val xer = factory.createXMLEventReader(new FileReader(args(0)))
val entities = new collection.mutable.ArrayBuffer[String]
while (xer.hasNext) {
    val event = xer.nextEvent
    if (event.isCharacters) {
        print(event.asCharacters.getData)
    } else if (event.getEventType == XMLStreamConstants.ENTITY_REFERENCE) {
        entities += event.asInstanceOf[EntityReference].getName
    }
}
println("------")
println("Entities: " + entities.mkString(", "))

给定以下 xhtml 文件……

<html>
    <head>
        <title>StAX Test</title>
    </head>
    <body>
        <h1>Hallo StAX</h1>
        <p id="html">
            &lt;div class=&quot;header&quot;&gt;
        </p>
        <p id="stuff">
            &Uuml;berdies sollte das hier auch als Copyright sichtbar sein: &#169;
        </p>
        Das war's!
    </body>
</html>

运行 scala stax-test.scala stax-test.xhtml 将导致:

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      <div class="header">


      berdies sollte das hier auch als Copyright sichtbar sein: ?

    Das war's!

------
Entities: Uuml

因此所有实体或多或少已经被成功取代。 不过,我所期望和想要的是:

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      &lt;div class=&quot;header&quot;&gt;


      &Uuml;berdies sollte das hier auch als Copyright sichtbar sein: &#169;

    Das war's!

------
Entities: // well, or no entities above and instead:
// Entities: lt, quot, quot, gt, Uuml, #169

这可能吗? 我想解析 XHTML,做一些修改,然后再次将其输出为 XHTML。所以我真的希望实体保留在结果中。

另外,我不明白为什么 Uuml 被报告为 EntityReference 事件,而其余事件则不然。

I want to read XHTML files using SAX or StAX, whatever works best.
But I don't want entities to be resolved, replaced or anything like that.
Ideally they should just remain as they are.
I don't want to use DTDs.

Here's an (executable, using Scala 2.8.x) example:

import javax.xml.stream._
import javax.xml.stream.events._
import java.io._

println("StAX Test - "+args(0)+"\n")
val factory = XMLInputFactory.newInstance
factory.setProperty(XMLInputFactory.SUPPORT_DTD, false)
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false)

println("------")
val xer = factory.createXMLEventReader(new FileReader(args(0)))
val entities = new collection.mutable.ArrayBuffer[String]
while (xer.hasNext) {
    val event = xer.nextEvent
    if (event.isCharacters) {
        print(event.asCharacters.getData)
    } else if (event.getEventType == XMLStreamConstants.ENTITY_REFERENCE) {
        entities += event.asInstanceOf[EntityReference].getName
    }
}
println("------")
println("Entities: " + entities.mkString(", "))

Given the following xhtml file ...

<html>
    <head>
        <title>StAX Test</title>
    </head>
    <body>
        <h1>Hallo StAX</h1>
        <p id="html">
            <div class="header">
        </p>
        <p id="stuff">
            Überdies sollte das hier auch als Copyright sichtbar sein: ©
        </p>
        Das war's!
    </body>
</html>

... running scala stax-test.scala stax-test.xhtml will result in:

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      <div class="header">


      berdies sollte das hier auch als Copyright sichtbar sein: ?

    Das war's!

------
Entities: Uuml

So all entities have been replaced more or less sucessfully.
What I would have expected and what I want is this, though:

StAX Test - stax-test.xhtml

------


    StAX Test


    Hallo StAX

      <div class="header">


      Überdies sollte das hier auch als Copyright sichtbar sein: ©

    Das war's!

------
Entities: // well, or no entities above and instead:
// Entities: lt, quot, quot, gt, Uuml, #169

Is this even possible?
I want to parse XHTML, do some modifications and then output it like that as XHTML again. So I really want the entities to remain in the result.

Also I don't get why Uuml is reported as an EntityReference event while the rest aren't.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

落在眉间の轻吻 2024-12-10 20:44:00

一些术语:ũ 是数字字符引用(不是实体),&#auml; 是实体引用(不是实体) 。

我不认为任何 XML 解析器都会向应用程序报告数字字符引用——它们总是会被扩展。实际上,您的应用程序不应该关心这一点,就像关心属性之间有多少空白一样。

至于实体引用,低级解析接口(例如 SAX)将报告实体引用的存在 - 无论如何,当实体引用出现在元素内容中时,它会报告它们,而不是出现在属性内容中。有些特殊事件仅通知给 LexicalHandler 而不是通知给 ContentHandler。

A bit of terminology: ũ is a numeric character reference (not an entity), and &#auml; is an entity reference (not an entity).

I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.

As for entity references, low-level parse interfaces such as SAX will report the existence of the entity reference - at any rate, it reports them when they occur in element content, but not in attribute content. There are special events notified only to the LexicalHandler rather than to the ContentHandler.

难得心□动 2024-12-10 20:44:00

“为什么 Uuml 被报告为 EntityReference 事件,而其他事件则不然”的答案是,其余事件由 XML 规范定义,而 Ü 特定于 HTML 4.0

由于您的目标是编写修改后的 XHTML,因此可能通过将“编码”设置为“US-ASCII”和/或将“方法”设置为“来强制序列化程序发出数字实体引用” html”。 XSLT 规范(它是 Java XML 序列化程序的基础)表示序列化程序“可以输出一个字符当方法是 html 时,使用字符实体引用”。如果不支持命名实体,则将编码设置为 ASCII 可能会强制其使用数字实体。

The answer to "why Uuml is reported as an EntityReference event while the rest aren't" is that the rest are defined by the XML spec, while Ü is specific to HTML 4.0.

Since your goal is to write modified XHTML, it may be possible to force the serializer to emit numeric entity references by setting the "encoding" to "US-ASCII" and/or the "method" to "html". The XSLT spec (which underlies Java XML serializers) says that the serializer "may output a character using a character entity reference" when the method is html. Setting the encoding to ASCII may force it to use numeric entities if named entities aren't supported.

诗化ㄋ丶相逢 2024-12-10 20:44:00

在 Java 中我会使用正则表达式。

public static void main(String... args) throws IOException {
  BufferedReader buf = new BufferedReader(new FileReader(args[0]));
  Pattern entity = Pattern.compile("&([^;]+);");
  Set<String> entities = new LinkedHashSet<String>();
  for (String line; (line = buf.readLine()) != null; ) {
    Matcher m = entity.matcher(line);
    while (m.find())
      entities.add(m.group(1));
  }
  buf.close();
  System.out.println("Entities: " + entities);
}

印刷

Entities: [lt, quot, gt, Uuml, #169]

In Java I would use a regular expression.

public static void main(String... args) throws IOException {
  BufferedReader buf = new BufferedReader(new FileReader(args[0]));
  Pattern entity = Pattern.compile("&([^;]+);");
  Set<String> entities = new LinkedHashSet<String>();
  for (String line; (line = buf.readLine()) != null; ) {
    Matcher m = entity.matcher(line);
    while (m.find())
      entities.add(m.group(1));
  }
  buf.close();
  System.out.println("Entities: " + entities);
}

prints

Entities: [lt, quot, gt, Uuml, #169]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文