如何使用 Java 和 SAX 解析带有偶尔 XML 标记的纯文本文件?
我有一个来自服务器的相当大的日志文件,其中包含纯文本。服务器记录它所做的每一件事,有时它会打印我有兴趣解析的 xml 标签。举个例子:
-----------log file-------------
bla bla bla random text
<logMessage>test Message</logMessage>
some more random server output
<logMessage>some other message</logMessage>
bla bla bla
end of log file
我只想从 << 中提取数据日志消息>标签并忽略其余部分。我正在使用 Java 和 SAX,但 SAX 解析器要求文件内容严格采用 XML 格式,并且它无法处理这种类型的文件。有没有办法告诉 SAX 忽略/忽略文件不是格式良好的 XML 的事实? 还有什么选择呢?逐行读取文件并查找标签? :(
I have a rather large log file from a server which contains plain text. The server logs every thing it does and occasionally it prints xml tags which I am interested in parsing. To give you an example:
-----------log file-------------
bla bla bla random text
<logMessage>test Message</logMessage>
some more random server output
<logMessage>some other message</logMessage>
bla bla bla
end of log file
I just want to extract the data from the < logMessage > tags and ignore the rest. I am using Java and SAX, but the SAX parser expects the content of the file to be strictly XML formatted and it cannot handle this type of file. Is there a way to tell SAX to ignore/overlook the fact that the file is not a well formatted XML?
What's the alternative? read the file line by line and look for the tags? :(
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为了简单起见,我会选择逐行读取文件并查找
和标记。请注意,您可以创建一个此类通用解析器,它采用委托解析器并向其提供类似 SAX 的事件。 (可能有用,具体取决于重写解析器的工作量,现在基于 SAX 的解决方案结果证明不起作用。)
编辑:如果您对多种元素感兴趣,则委托方法也很有用。如果它们碰巧具有复杂(嵌入)的 XML 层次结构,您甚至可以将开始和结束标记之间的所有字符整理到缓冲区中,然后将该缓冲区提供给真实 SAX 解析器。在大多数情况下,这可能有点过头了,但同样,如果您的日志本质上包含 XML 转储,那么它可能比尝试自己解析所有日志更合适。
For simplicity's sake I would opt for reading the file line by line and looking for
<logMessage>
and</logMessage>
tokens. Note that you can make a generic parser of that kind which takes a delegate parser and feeds it SAX-like events. (May be useful depending on how much work it would otherwise be to rewrite parsers, now your SAX based solution turns out to not work.)EDIT: The delegate approach is also useful if you are interested in more than one kind of element. If these happen to have complex (embedded) XML hierarchies, you could even collate all the characters in between the opening and closing tokens into a buffer, then feed that buffer to a real SAX parser. This would be overkill in most cases, but again, if you have logs which essentially contains XML dumps it might be more suitable than trying to parse it all yourself.
我认为直接 XML 解析不适合解析此类文件。如果所有 XML 片段都包含在该行中(开始和结束标记位于同一行),则逐行读取并检查标记是否存在,跳过非 XML 行将是最简单的方法。跳过非 XML 行后,您可以将要处理的流传递给 SAX 解析器,或者仅逐行使用正则表达式。
本质上,上面的方法与 grep 文件相同,首先只留下 XML 标签,然后将其包装在根元素中以生成格式良好的 XML 并解析它。
I don't think straight XML parsing would be appropriate for parsing this sort of file. If all XML snippets are contained in the line (opening and closing tags are on the same line) then reading it line by line and checking for presence of tags, skipping non-XML lines would be simplest way to do it. After you skipped non-XML lines you could pass stream for processing to SAX parser, or just use regexp on line-by-line basis.
Essentially above approach is identical to grepping file first to leave only XML tags, then wrap it in root element to make well formed XML and parse it.