如何在不解析整个文件的情况下获取树的根?
我正在制作一个 xml 解析器来解析来自不同工具的 xml 报告,并且每个工具都会生成带有不同标签的不同报告。
例如:
Arachni 生成一个 xml 报告,其中
作为树根标记。
nmap 生成一个 xml 报告,其中
作为树根标记。
我试图不解析整个文件,除非它是来自我想要的任何工具的有效报告。
我首先想到使用的是ElementTree,解析整个xml文件(假设它包含有效的xml),然后根据树根检查报告是否属于Arachni或nmap。
我目前正在使用 cElementTree,据我所知 getroot() 在这里不是一个选项,但我的目标是使这个解析器仅处理已识别的文件,而不解析不必要的文件。
顺便说一下,我还在学习 xml 解析,提前致谢。
I'm making an xml parser to parse xml reports from different tools, and each tool generates different reports with different tags.
For example:
Arachni generates an xml report with <arachni_report></arachni_report>
as tree root tag.
nmap generates an xml report with <nmaprun></nmaprun>
as tree root tag.
I'm trying not to parse the entire file unless it's a valid report from any of the tools I want.
First thing I thought to use was ElementTree, parse the entire xml file (supposing it contains valid xml), and then check based on the tree root if the report belongs to Arachni or nmap.
I'm currently using cElementTree, and as far as I know getroot() is not an option here, but my goal is to make this parser to operate with recognized files only, without parsing unnecessary files.
By the way, I'm Still learning about xml parsing, thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
“简单的字符串方法”是万恶之源(双关语)——请参见下面的示例。
更新 2 代码和输出现在表明建议的正则表达式也不能很好地工作。
使用元素树。您正在寻找的函数是iterparse。启用“开始”事件。第一次迭代就出局。
代码:
以上 ElementTree 相关代码适用于 Python 2.5 至 2.7。可与 Python 2.2 至 2.4 配合使用;您只需要从 effbot.org 获取 ElementTree 和 cElementTree 并进行一些条件导入。应该适用于任何 lxml 版本。
输出:
更新 1 以上是演示代码。下面更像是实现代码...只需添加异常处理。使用 Python 2.7 和 2.2 进行测试。
"simple string methods" are the root [pun intended] of all evil -- see examples below.
Update 2 Code and output now show that proposed regexes also don't work very well.
Use ElementTree. The function that you are looking for is
iterparse
. Enable "start" events. Bale out on the first iteration.Code:
Above ElementTree-related code works with Python 2.5 to 2.7. Will work with Python 2.2 to 2.4; you just need to get ElementTree and cElementTree from effbot.org and do some conditional importing. Should work with any lxml version.
Output:
Update 1 The above was demonstration code. Below is more like implementation code... just add exception handling. Tested with Python 2.7 and 2.2.
我对你的问题的理解是这样的:你想要检查一个文件以确定它是否是你可识别的格式之一,并且只有在你知道它是可识别的格式之一时才将其解析为 XML。 @eyquem 是对的:您应该使用简单的字符串方法。
最简单的做法是从文件开头读取一些少量内容,看看它是否有您认识的根元素:
此方法的优点是在确定文件是否有趣之前只读取文件的少量内容文件与否。
My understanding of your problem is this: You want to examine a file to determine if it is one of your recognized formats, and only parse it as XML if you know that it is one of the recognized formats. @eyquem is right: you should use simple string methods.
The simplest thing to do is to read some small amount from the beginning of the file, and see if it has a root element you recognize:
This method has the advantage that only a small amount of the file is read before determining whether it's an interesting file or not.
对于 XML 行家来说,这看起来有趣吗? :
结果,仍然
如果有必要,可以添加一个验证,以确保树根的开始标记也在文件的开头附近
。
如果文件很大,为了加快处理速度,我们可以将文件指针移到文件末尾附近(比如文件末尾前 200 或 600 个字符),以仅读取和搜索长度为 200 或 600 个字符的字符串( XL 树根的结束标记没有更长的长度,不是吗?)
Does this seem interesting to a connoisseur of XML ? :
result, still
If necessary, one could add a verification that the start-tag of tree root is also around the beginning in the file
.
If the file is big, to speed up the treatment, we can move the pointeur of the file near the file's end (say 200 or 600 characters ante the end) to read and search in only a string of 200 or 600 characters long (the end-tag of the tree root of an XL doesn't have a greater length, does it ?)
John Machin,当你证明我的代码无法正常工作时,你是认真的吗?
由于我不太了解 XML 格式,所以我去了那里:
W3C's XML validator
结论是您的文本样本格式不正确。因此 :
你的意思是我应该编写了能够检测非 XML 文件中的树根标记的代码吗?我不知道我要满足这个过高的要求。
。
下面的代码比仅使用字符串方法的代码稍微吓人一点。我之前没有给出它,因为我会收到通知“..whisp...你不得使用正则表达式来分析 XML 文本...whisp Whisp”
它可以做同样的事情在 XML 的更嘈杂的开头。
事实上,我更喜欢 John Machin 给出的解决方案,使用 ElementTree 的 iterparse() 函数,就是这样!
。
编辑
毕竟,我想知道为什么这还不够......
Are you serious, John Machin , when you show that my code wouldn't work correctly ?
Since I don't know well the XML format, I went there:
W3C's XML validator
Conclusion is that your text samples are not well-formed. Hence :
Did you mean that I was supposed to have written a code able to detect tree root's tag in non-XML files ? I didn't know I had this over-requirement to fulfill.
.
Here's a code that frightens a little less than the one using only string methods. I didn't give it before because I would have received notifications that "..whisp...you MUST not employ regexes to analyse an XML text... whisp whisp"
It coud be done the same in the more noisy beginning of the XML.
In fact, I prefer the solution given by John Machin, with the iterparse() function of ElementTree, and that's it !
.
EDIT
After all, I wonder why this wouldn't be enough....
最终编辑:
感谢John Machin,我将根据他的答案(这是我选择的正确答案)使用以下代码(这是草稿)。
我还要感谢 eyquem 的回复和他坚持捍卫自己的代码,我真的学到了很多:)
Final edit:
Thanks to John Machin I'll be using this following code (this is a draft) based on his answer (which is the one I selected as correct).
I'd also like to thank eyquem for his responses and his persistence on defending his codes, I really learned a lot :)
约翰·梅钦,这就是您要找的东西:我们的连续剧的续集。我验证了这次我的大脑在正确的位置,我继续思考这个问题。
这样您就扩展了演示代码。
现在,通过您的几个示例文本,我很清楚字符串方法还远远不够,我明白为什么。我非常有兴趣了解流程的底层并了解肯定的具体原因。
然后,我比以前更深入地研究了 XML 规范,并使用 W3c 的验证器进行了测试,以加深对 XML 文本结构细节的理解。这是一个相当严峻的职业,但也很有趣。我发现 XML 的格式是非常严格的规则和温文尔雅的自由的混合体。
从您在示例中使用的将我的代码分成几部分的技巧,我得出的结论是 XML 格式不需要将文本分成行。事实上,正如 W3c 的验证器向我展示的那样,字符
\n
、\r
和\t
可以可以出现在 XML 文本中的许多位置,前提是它们不违反结构规则。例如,它们的授权不受标签之间的任何限制:因此,一个元素可能会占用多行。此外,即使标签也可以分成几行,或者分成几个表格
\t
,前提是它们出现在一个标签的名称之后。 XML 文本的行不需要缩进,就像我一直看到的那样:我现在明白,这只是为了便于阅读和逻辑理解而选择的个人方便。好吧,你比我更了解这一切,约翰·梅钦。感谢您,我现在意识到 XML 格式的复杂性,并且我更好地理解了通过专用解析器以外的其他方式进行解析不切实际的原因。我顺便想知道普通编码人员是否意识到 XML 格式的这种尴尬:XML 文本中可能会出现
\n
字符。。
不管怎样,因为我已经在这个概念沸腾的锅里呆了一段时间了,所以我继续为你的 wac_moles 寻找解决方案,John Machin,作为一个有启发性的游戏。
字符串方法不再适用,我完善了我的正则表达式。
我知道,我知道:您会说即使使用正则表达式也无法分析 XML 文本。现在我更清楚为什么了,我同意。但我不会假装解析 XML 文本:我的正则表达式不会提取 XML 树的任何部分,它只会搜索一小部分文本。对于OP提出的问题,我认为使用正则表达式不是异端。
。
从一开始,我就认为搜索根的结束标签更容易和自然,因为结束标签没有属性,并且它周围的“噪音”比根的开始标签少。
所以我现在的解决方案是:
越大,与使用 parse 或 iterparse 相比,该算法的速度越快。我编写并检查了以下代码的所有结果。第一个策略是更快的策略。
结果
此代码现在测量执行时间:
结果:
。
。
考虑到您的简单需求,Aereal,我认为您不关心根的结束标记可能包含字符
\r
\n
\t
在其中,而不是单独的名称;因此,在我看来,最适合您的解决方案是:由于 John Machin 的专业知识,该解决方案比我以前的解决方案做得更可靠;此外,它完全满足了需求,正如所表达的那样:无需解析,因此是一种更快的方法,正如它隐含的目标一样。
。
John Machin,您是否会发现 XML 格式的一个新的棘手功能会使该解决方案失效?
Here is what you were looking for, John Machin: the sequel of our serial . I verified that this time my brain was in its correct place, and I continued to think about the problem.
So you have extended the demonstration code.
Now, with your several exemplifying texts, it is clear for me that the string methods are far to be sufficient, and I UNDERSTAND why. I am very interested to know the underneath of processes and to understand the concrete reasons of affirmations.
Then I studied more than I ever did the specifications of XML and performed tests with the W3c's validator to increase my understanding of details of the structure of a XML text. It's a rather severe occupation but interesting though. I saw that the format of an XML is a mix of very strict rules and of debonair liberties.
From the tricks you used in your exemples to tear my codes into pieces, I conclude that XML format doesn't require the text to be divided into lines. In fact, as the W3c's validator showed me, characters
\n
,\r
and\t
can be at many positions in a XML text, provided that they don't break a rule of structure.For exemple they are authorized without any restriction between tags: as a consequence, an element may occupy several lines. Also, even tags can be splitted into several lines, or among several tabulations
\t
, provided that they occur after the name of one tag. There is nor requirement for the lines of a XML text to be indented as I always saw them: I understand now it's only a personal convenience choosen for ease of reading and logical comprehension.Well, you know all that better than me, John Machin. Thanks to you, I am now alerted to the complexity of XML format and I better understand the reasons that make parsing unrealistic by other means than specialized parsers. I incidentally wonder if common coders are aware of this awkardness of XML format: the possibility of
\n
characters present here and there in an XML text..
Anyway, as I have been in this conceptual boiling pot for a while now, I continued to search for a solution for your whac_moles, John Machin, as an instructive play.
String methods being out of the game, I polished my regex.
I know, I know: you'll say me that analyzing an XML text can't be done even with a regex. Now that I know better why, I agree. But I don't pretend to parse an XML text: my regex won't extract any part of an XML tree, it will search only a little chunk of text. For the problem asked by OP, I consider the use of regex as non heretical.
.
From the beginning, I think that searching the end-tag of the root is more easy and natural, because an end-tag hasn't attributes and there is less "noise" around it than the start-tag of the root.
So my solution is now:
The bigger the file is, the speeder is this algorithm compared to the use of parse or iterparse. I wrote and examined all the results of the following codes. The first strategy is the faster one.
Result
This code now measures the execution's times:
Result:
.
.
Considering your unsophisticated need, Aereal, I think that you don't care to have an end-tag of the root with possible characters
\r
\n
\t
in it, instead of its name alone; So the best solution for you is, in my opinion:Thanks to the expertise of John Machin, this solution do a more reliable job than my previous one; and in addition it answers exactly to the demand, as it was expressed: no parsing, hence a faster method, as it was implicitly aimed at.
.
John Machin, will you find a new tricky feature of XML format that will invalidate this solution ?