用于维基解密电缆的基于 Python 的容错解析器
不久前,我开始为以下内容编写基于 BNF 的语法 维基解密发布的电缆。然而我现在意识到我的方法可能不是最好的,我正在寻求一些改进。
电缆由三部分组成。头部有一些 RFC2822 风格的格式。这通常解析正确。文本部分有更非正式的规范。例如,有一个 REF 行。这应该以 REF:
开头,但我发现了不同的版本。以下正则表达式可捕获大多数情况:^\s*[Rr][Ee][Ff][Ss: ]
。所以前面有空格,大小写不同等等。文本部分主要是纯文本,带有一些特殊格式的标题。
我们想要识别每个字段(日期、REF 等)并放入数据库中。我们选择了Python的SimpleParse。目前,解析在它无法识别的每个字段处停止。我们现在正在寻找更容错的解决方案。所有字段都有某种顺序。当解析器无法识别某个字段时,它应该向当前字段添加一些“无法识别”的 blob 并继续。 (或者也许你在这里有更好的方法)。
您会建议哪种解析器或其他类型的解决方案?周围有更好的东西吗?
Some time ago I started writing a BNF-based grammar for the cables which WikiLeaks released. However I now realized that my approach is maybe not the best and I'm looking for some improvement.
A cabe consists of three parts. The head has some RFC2822-style format. This parses usually correct. The text part has a more informal specification. For instance, there is a REF-line. This should start with REF:
, but I found different versions. The following regex catches most cases: ^\s*[Rr][Ee][Ff][Ss: ]
. So there are spaces in front, different cases and so on. The text part is mostly plain text with some special formatted headings.
We want to recognize each field (date, REF etc.) and put into a database. We chose Pythons SimpleParse. At the moment the parses stops at each field which it doesn't recognize. We are now looking for a more fault-tolerant solution. All fields have some kind of order. When the parser don't recognize a field, it should add some 'not recognized'-blob to the current field and go on. (Or maybe you have some better approach here).
What kind of parser or other kind of solution would you suggest? Is something better around?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Cablemap 似乎可以满足您的需求: http://pypi.python.org/pypi /cablemap.core/
Cablemap seems to do what you're searching for: http://pypi.python.org/pypi/cablemap.core/
我没有查看电缆,但让我们考虑类似的问题并考虑选项:假设您想为 RFC 编写一个解析器,有一个用于格式化 RFC 的 RFC,但并非所有 RFC 都遵循它。
如果您编写了严格的解析器,您将遇到您所遇到的情况 - 异常值将停止您的进度 - 在这种情况下,您有两个选择:
将它们分成两组,即严格格式化的和不严格格式化的。编写严格的解析器,以便它获得唾手可得的成果,并根据离群值的数量找出最佳选项是什么(手动处理、离群值解析器等)
如果两个组大小相等,或者有更多离群值比标准格式 - 编写一个灵活的解析器。在这种情况下,正则表达式将对您更有利,因为您可以处理整个文件以查找一系列灵活的正则表达式,如果其中一个正则表达式失败,您可以轻松生成异常值列表。但是,由于您可以针对一系列正则表达式进行搜索,因此您可以为每个正则表达式构建通过/失败矩阵。
对于有些遵循格式而有些不遵循格式的“模糊”数据,我更喜欢使用正则表达式方法。但这只是我。 (是的,它比较慢,但是在处理人类生成的输入时,必须设计每个匹配段之间的关系,以便拥有适合每个极端情况的单个查询(或解析器),这是一场噩梦。
I haven't looked at the cables but lets take a similar problem and consider the options: Lets say you wanted to write a parser for RFCs, there's an RFC for formatting of RFCs, but not all RFCs follow it.
If you wrote a strict parser, you'll run into the situation you've run into - the outliers will halt your progress - in this case you've got two options:
Split them into two groups, the ones that are strictly formatted and the ones that aren't. Write your strict parser so that it gets the low hanging fruit and figure out based on the number outliers what the best options are (hand processing, outlier parser, etc)
If the two groups are equally sized, or there are more outliers than standard formats - write a flexible parser. In this case regular expressions are going to be more beneficial to you as you can process an entire file looking for a series of flexible regexs, if one of the regexes fails you can easily generate the outlier list. But, since you can make the search against a series of regexes you could build a matrix of pass/fails for each regex.
For 'fuzzy' data where some follow the format and some do not, I much prefer using the regex approach. That's just me though. (Yes, it is slower, but having to engineer the relationship between each match segment so that you have a single query (or parser) that fits every corner case is a nightmare when dealing with human generated input.