当前位置：文江博客话题详情

从 Xerces 中的 DOM 获取 XML 实体替换文本

发布于 2024-11-09 15:38:01 字数 1528 浏览 8 评论 0 原文

org.w3c 的 Javadoc .dom.Entity 状态：

XML 不强制非验证 XML 处理器读取和处理在外部子集中进行的实体声明或在参数实体中声明的实体声明。这意味着在外部子集中声明的已解析实体不需要由某些类别的应用程序扩展，并且实体的替换文本可能不可用。当替换文本可用时，相应的Entity节点的子列表表示该替换值的结构。否则，子列表为空。

虽然它不引用内部子集中所做的实体声明，但肯定有一些解析器配置可以读取和处理任一子集中的实体声明？事实上，我对文档的阅读表明这是默认的。

无论如何，我已经针对已在内部子集（如图所示）和外部子集中声明的实体测试了以下方法（使用 Xerces），但 foo.hasChildNodes() 返回 false （并且 foo.getChildNodes() 返回 foo！）在每种情况下：

// some trivial example XML
String xml = "<!DOCTYPE example [ <!ENTITY foo 'bar'> ]>\n<example/>";
InputStream is = new ByteArrayInputStream(xml.getBytes());

// parse
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
DocumentType docType = builder.parse(is).getDoctype();

// retrieve the entity - works fine
Entity foo = (Entity) docType.getEntities().getNamedItem("foo");

// now how to get the entity's replacement text?

毫无疑问，我错过了一些相当明显的东西；感谢你的想法。

编辑

从到目前为止的答案来看，我的 Xerces 实现行为不正常。我将尝试将所有 Xerces 库更新到最新版本，如果这解决了我的问题，我将结束该问题。非常感谢。

更新

更新 Xerces 确实解决了问题，前提是该实体是从文档内部引用的；如果不是，则该节点仍然没有子节点。我并不完全清楚为什么会出现这种情况。如果有人可以解释发生了什么和/或向我指出如何强制创建子节点而不显式引用文档中的每个实体，我将不胜感激。

原文

The Javadoc for org.w3c.dom.Entity states:

XML does not mandate that a non-validating XML processor read and process entity declarations made in the external subset or declared in parameter entities. This means that parsed entities declared in the external subset need not be expanded by some classes of applications, and that the replacement text of the entity may not be available. When the replacement text is available, the corresponding Entity node's child list represents the structure of that replacement value. Otherwise, the child list is empty.

Whilst it does not refer to entity declarations made in the internal subset, there must surely be some configuration of parser which will read and process entity declarations in either subset? Indeed, my reading of the documentation would suggest that this is the default.

In any event, I have tested the following approach (using Xerces) against entities which have been declared in the internal subset (as shown) and also in an external subset, but foo.hasChildNodes() returns false (and foo.getChildNodes() returns foo!) in every case:

// some trivial example XML
String xml = "<!DOCTYPE example [ <!ENTITY foo 'bar'> ]>\n<example/>";
InputStream is = new ByteArrayInputStream(xml.getBytes());

// parse
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
DocumentType docType = builder.parse(is).getDoctype();

// retrieve the entity - works fine
Entity foo = (Entity) docType.getEntities().getNamedItem("foo");

// now how to get the entity's replacement text?

No doubt I am missing something rather obvious; grateful for your thoughts.

EDIT

It appears from the answers so far that my Xerces implementation is misbehaving. I will try to update all Xerces libraries to latest versions and, if that solves my problem, I will close off the question. Many thanks.

UPDATE

Updating Xerces has indeed solved the problem, provided that the entity is referenced from within the document; if it is not, then the node still has no children. It is not entirely clear to me why this should be the case. Grateful if someone could explain what's going on and/or point me to how I can force the creation of the child nodes without explicitly referencing every entity from within the document.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

悟红尘 2024-11-16 15:38:01

我认为您可能误解了替换文本的工作原理。基于一些阅读（http://www.javacommerce.com/ displaypage.jsp?name=entities.sql&id=18238)，在我看来，替换文本就像变量一样。因此，在上面的示例中，您从未引用 &foo; 实体。如果运行下面的代码示例，您将看到发生的情况是 &foo; 被字符串 bar 替换：

// some trivial example XML
String xml = "<!DOCTYPE example [ <!ENTITY foo 'bar'> ]><example><foo>&foo;</foo></example>";
InputStream is = new ByteArrayInputStream(xml.getBytes());

// parse
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(is);
DocumentType docType = doc.getDoctype();

// retrieve the entity - works fine
Entity foo = (Entity) docType.getEntities().getNamedItem("foo");
for(int i = 0; i < foo.getChildNodes().getLength(); i++) {
  System.out.println(foo.getChildNodes().item(i));
}

您看到打印的是 [# text: bar] 这是 XML 中的文本替换。

I think you may be mistaken how the replacement text works. Based on some reading (http://www.javacommerce.com/displaypage.jsp?name=entities.sql&id=18238), it looks to me like the replacement text works like a variable. So, in your example above you are never referencing the &foo; entity. If you run the code sample below you will see that what happens is the &foo; gets replaced with the string bar:

// some trivial example XML
String xml = "<!DOCTYPE example [ <!ENTITY foo 'bar'> ]><example><foo>&foo;</foo></example>";
InputStream is = new ByteArrayInputStream(xml.getBytes());

// parse
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(is);
DocumentType docType = doc.getDoctype();

// retrieve the entity - works fine
Entity foo = (Entity) docType.getEntities().getNamedItem("foo");
for(int i = 0; i < foo.getChildNodes().getLength(); i++) {
  System.out.println(foo.getChildNodes().item(i));
}

What you see printed is [#text: bar] which is the text replacement within the XML.

回复收藏 0 原文

云醉月微眠 2024-11-16 15:38:01

我可能是错的，但我认为实体节点将替换文本存储为文本值，而不是节点集；这是因为在解析实体定义时，实体实际上并未完全解析：这主要是因为 DTD 处理程序是在实际解析过程之前发生的预处理器。
因此，检查实体节点的文本值而不是子节点列表。

回复收藏 0 原文

只有一腔孤勇 2024-11-16 15:38:01

我不知道为什么 foo.getChildNodes() 不起作用，但我发现了以下内容。如果在文档中使用（引用）实体，

]>\n&foo;，

获得替换文本

则可以通过foo.getTextContent()

回复收藏 0 原文

转角预定愛 2024-11-16 15:38:01

我在 Xerces-J 用户邮件列表上询问了不存在文档中未引用实体的子节点； Michael Glavassevich 帮助我走向Andy Clark 的一篇旧帖子解释如下：

不幸的是（对你来说）这是一个功能。并已实施
这种方式主要是出于性能考虑。如果一个实体永远不会
文档中引用了，那么我们就不必浪费时间了
阅读它。如果外部实体很大但从未被引用，
我们不会浪费时间或内存。

此外，还有一个与命名空间相关的更深层次的问题。 DOM
甚至帮不上忙。我会解释一下...

获取以下文档和外部实体：
 
  <你好/>>

  ;
  
  ]>
  <根>
    _{; &实体；}
    _{; &实体；}
  
请注意，默认命名空间在每个点都不同
实体被引用的地方。这意味着
元素将绑定到不同的名称空间。所以两者
同一实体的实例实际上是不同的元素！

在这种情况下，DOM 文档类型中的 Entity 节点应该是什么
返回：“foo”命名空间中的子级或“bar”中的子级
命名空间？

简而言之，这是一个复杂的问题。

您最好尝试阅读文档片段
当您查找实体节点时，它没有
孩子们。 Xerces 在 impl 中有一个文档片段扫描仪
对此目的有用的包。你必须
编写为 DOM 文档片段构建子级的代码
不过，来自 XNI 方法。但这并不难做到。我可以
如果需要，请给您提供一个示例。

I asked on the Xerces-J Users mailing list about the non-existence of child nodes where the entities are not referenced within the document; there Michael Glavassevich helpfully pointed me towards an old post from Andy Clark explaining as follows:

Unfortunately (for you) this is a feature. And it was implemented
this way mainly for performance reasons. If an entity is never
referenced in the document, then we never have to waste time
reading it. If the external entity is huge but never referenced,
we don't waste time or memory.

Plus, there is a deeper problem in relation to namespaces. DOM
can't even help. I'll explain...

Take the following document and external entity:
  
  <hello/>

  
  <!DOCTYPE root [
  <!ENTITY entity SYSTEM 'entity.ent'>
  ]>
  <root>
    <sub xmlns='foo'> &entity; </sub>
    <sub xmlns='bar'> &entity; </sub>
  </root>
Notice that the default namespace is different at each point
where the entity is referenced. This means that the
element will be bound to different namespaces. So both
instances of the same entity are actually different elements!

In this situation, what should the Entity node in the DOM doctype
return: children in the "foo" namespace or children in the "bar"
namespace?

In short, it's a complicated issue.

You might be best off trying to read the document fragment
yourself when you look for the Entity node and it has no
children. Xerces has a document fragment scanner in the impl
package that would be useful for this purpose. You'd have to
write code that builds children for a DOM document fragment
from XNI methods, though. But this isn't hard to do. I can
point you to an example if you need it.