解析 XML 文件并保留有关行号的信息
我正在创建一个工具来分析一些 XML
文件(准确地说是 XHTML
文件)。该工具的目的不仅是验证 XML 结构,还可以检查某些属性的值。
因此,我创建了自己的 org.xml.sax.helpers.DefaultHandler 来处理 XML 解析期间的事件。我的要求之一是获得有关当前行号的信息。因此,我决定将 org.xml.sax.helpers.LocatorImpl 添加到我自己的 DefaultHandler 中。这几乎解决了我所有的问题,除了有关 XML 属性的问题。
举个例子:
<rootNode>
<foo att1="val1"/>
<bar att2="val2"
answerToEverything="43"
att3="val3"/>
</rootNode>
我的一条规则表明,如果在节点 bar
上定义了属性 answerToEverything
,则其值不应与 42
。
当遇到此类 XML 时,我的工具应该检测到错误。因为我想给用户一个精确的错误消息,例如:
文件“foo.xhtml”第 4 行出错:answerToEverything 只允许“42”作为值。
我的解析器必须能够在解析过程中保留行号,即使是属性。如果我们为我自己的 DefaultHandler
类考虑以下实现:
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
System.out.println("Start element <" + qName + ">" + x());
for (int i = 0; i < attributes.getLength(); i++) {
System.out.println("Att '" + attributes.getQName(i) + "' = '" + attributes.getValue(i) + "' at " + locator.getLineNumber() + ":" + locator.getColumnNumber());
}
}
那么对于节点 >bar>
,它将显示以下输出:
从 5:23 开始元素
Att 'att2' = 'val2' 在 5:23
Att 'answerToEverything' = '43' 在 5:23
Att 'att3' = 'val3' 在 5:23
如您所见,行号是错误的,因为解析器会将整个节点(包括其属性)视为一个块。
理想情况下,如果接口 ContentHandler
定义了 startAttribute
和 startElementBeforeReadingAttributes
方法,我在这里不会有任何问题:o)
所以我的问题我该如何解决我的问题?
作为信息,我正在使用 Java 6
ps:也许这个问题的另一个标题可能是 Java SAX 解析与属性解析事件,或者类似的东西......
I am creating a tool that analyzes some XML
files (XHTML
files to be precise). The purpose of this tool is not only to validate the XML structure, but also to check the value of some attributes.
So I created my own org.xml.sax.helpers.DefaultHandler
to handle events during the XML parsing. One of my requirements is to have the information about the current line number. So I decided to add a org.xml.sax.helpers.LocatorImpl
to my own DefaultHandler
. This solves almost all my problems, except one regarding the XML attributes.
Let's take an example:
<rootNode>
<foo att1="val1"/>
<bar att2="val2"
answerToEverything="43"
att3="val3"/>
</rootNode>
One of my rules indicates that if the attribute answerToEverything
is defined on the node bar
, its value should not be different from 42
.
When encountering such XML, my tool should detect an error. As I want to give a precise error message to the user, such as:
Error in file "foo.xhtml", line #4: answerToEverything only allow "42" as value.
my parser must be able to keep the line number during the parsing, even for attributes. If we consider the following implementation for my own DefaultHandler
class:
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
System.out.println("Start element <" + qName + ">" + x());
for (int i = 0; i < attributes.getLength(); i++) {
System.out.println("Att '" + attributes.getQName(i) + "' = '" + attributes.getValue(i) + "' at " + locator.getLineNumber() + ":" + locator.getColumnNumber());
}
}
then for the node >bar>
, it will display the following output:
Start element at 5:23
Att 'att2' = 'val2' at 5:23
Att 'answerToEverything' = '43' at 5:23
Att 'att3' = 'val3' at 5:23
As you can see, the line number is wrong because the parser will consider the whole node, including its attributes as one block.
Ideally, if the interface ContentHandler
would have defined the startAttribute
and startElementBeforeReadingAttributes
methods, I wouldn't have any problem here :o)
So my question is how can I solve my problem?
For information, I am using Java 6
ps: Maybe another title for this question could be Java SAX parsing with attributes parsing events, or something like that...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为实现这一点的唯一方法是创建您自己的 InputStream(或 Reader)来计算行数并以某种方式与您的 SAX 处理程序进行通信。我还没有尝试自己实现这一点,但我相信这是可能的。我祝你好运,如果你成功做到这一点并在这里发布你的结果,我会很高兴。
I think that only way to implement this is to create your own InputStream (or Reader) that counts lines and somehow communicates with your SAX handler. I have not tried to implement this myself but I believe it is possible. I wish you good luck and would be glad if you succeed to do this and post your results here.
寻找一个开源 XML 编辑器,它的解析器可能有此信息。
编辑器使用的解析器与仅使用 xml 获取数据的应用程序所使用的解析器不同。编辑需要更多信息,就像你说的行号一样,我也会考虑有关空白字符的信息。编辑器的解析器不应丢失有关文件中字符的任何信息。这就是您可以实现格式函数或“选择封闭元素”(Eclipse 中的 Alt-Shift-Up)等的方式。
Look for an open source XML editor, its parser might have this information.
Editors don't use the same kind of parser that an application that just uses xml for data would use. Editors need more information, like you say line numbers and I would also think information about whitespace characters. A parser for an editor should not lose any information about characters in the file. That is the way you can implement for example a format function or "select enclosing element" (Alt-Shift-Up in Eclipse).
在 XmlBeans 和 JAXB 中都可以保留行号信息。您可以考虑使用这些工具之一(在 XmlBeans 中更容易)。
In both XmlBeans and JAXB it is possible to preserve line number information. You could consider using one of these tools (it is easier in XmlBeans).