XML 处理指令和空白
我目前正在为 node.js
开发 XML/HTML 解析器(如果您感兴趣:
< ?asdf ?>
< ? asdf ?>
我想严格的 XML 只会允许第一个(但是预期的行为是什么?我不想验证,我想接受尽可能多的构造),它更像是一种哲学问题。
提前致谢!
I'm currently working on a XML/HTML parser for node.js
(if your interested: link). Let me get right to the point: I need to know how I should handle leading whitespace inside processing instructions. Should these be equal?
<?asdf ?>
< ?asdf ?>
<? asdf ?>
< ? asdf ?>
I guess that strict XML will just allow the first one (but what's the expected behavior then? I don't want to validate, I want to accept the most constructs I can), it's more a philosophical question.
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据 XML 规范,仅允许第一种表示形式。我想说其他表示应该会导致错误。
您可以添加一些预处理来清理无效构造(删除空格),然后将数据读取为 XML。
该预处理器会在数据到达 XML 解析器之前清理您的数据 - 它可能是另一个程序。这样,如果输入数据半有效,您的 XML 解析器只会获得有效的 XML(需要解析的特殊情况较少)。如果您的解析器仍然遇到错误,您会认为输入根本不是 XML 格式的。
例如,在预处理过程中,数据将被更改,最终解析为 XML:
删除虚假空白(一个预处理器)→ 猜测结束标签(其他预处理器)→ 解析为 XML
关于允许的结构的问题由您的语句回答,您可以尽可能接受。因为在这种情况下,您将删除
<
之后的所有空格,如果后面跟着?
,则再次删除空格,直到下一个单词 - 然后解析为 XML。就我个人而言,我认为接受大多数构造是不可取的。如果您的数据包含错误,则应按原样处理。
According to the XML specification only the first representation is allowed. I'd say the other representations should result in an error.
You could add a some pre-processing to clean up the invalid constructs (remove the whitespace) and then read the data as XML.
This pre-processor would clean your data before it reaches your XML parser – it could be another program. That way your XML parser would only get valid XML (less special cases to parse) if the input data is halfway valid. If your parser does still encounter an error, you'd assume that the input was not XML-ish at all.
So for example during pre-processing the data would be altered, finally parsed as XML:
Remove bogus whitespace (one preprocessor) → Guess closing tags (other preprocessor) → Parse as XML
The question for the allowed constructs is answered by your statement to accept as most you can. Because this is the case you would remove all whitespace after a
<
, if a?
follows, again do remove whitespace until the next word – then parse as XML.Personally, I don't think accepting most constructs is desirable. If your data contains errors, they should be handled as such.