如何在 Parslet 中定义固定宽度约束
我正在研究 parslet 来编写大量数据导入代码。总的来说,图书馆看起来不错,但我正在努力解决一件事。我们的许多输入文件都是固定宽度的,并且格式之间的宽度有所不同,即使实际字段没有。例如,我们可能会得到一个包含 9 个字符货币的文件,另一个包含 11 个字符(或其他)的文件。有谁知道如何在 Parslet 原子上定义固定宽度约束?
理想情况下,我希望能够定义一个能够理解货币的原子(带有可选的美元符号、千位分隔符等...)然后我就能够基于旧原子即时创建一个新原子这是完全等价的,只是它正好解析 N 个字符。
Parslet 中是否存在这样的组合器?如果没有,我自己写一个可能/困难吗?
I am looking into parslet to write alot of data import code. Overall, the library looks good, but I'm struggling with one thing. Alot of our input files are fixed width, and the widths differ between formats, even if the actual field doesn't. For example, we might get a file that has a 9-character currency, and another that has 11-characters (or whatever). Does anyone know how to define a fixed width constraint on a parslet atom?
Ideally, I would like to be able to define an atom that understands currency (with optional dollar signs, thousand separators, etc...) And then I would be able to, on the fly, create a new atom based on the old one that is exactly equivalent, except that it parses exactly N characters.
Does such a combinator exist in parslet? If not, would it be possible/difficult to write one myself?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
像这样的事情怎么样...
这将会失败:
以下是你如何做到的:
What about something like this...
This will fail with:
Here's how you do it:
解析器类中的方法基本上是 Parlet 原子的生成器。这些方法最简单的形式是“规则”,即每次调用时仅返回相同原子的方法。创建您自己的生成器也同样容易,但它们并不是那么简单的野兽。请查看 http://kschiess.github.com/parslet/tricks.html此技巧的说明(匹配字符串不区分大小写)。
在我看来,您的货币解析器是一个只有几个参数的解析器,您可能可以创建一个方法(def ... end)来返回根据您的喜好定制的货币解析器。甚至可能使用初始化和构造函数参数? (即:MoneyParser.new(4,5))
如需更多帮助,请将您的问题发送至邮件列表。如果用代码来说明,这些问题通常更容易回答。
Methods in parser classes are basically generators for parslet atoms. The simplest form these methods come in are 'rule's, methods that just return the same atoms every time they are called. It is just as easy to create your own generators that are not such simple beasts. Please look at http://kschiess.github.com/parslet/tricks.html for an illustration of this trick (Matching strings case insensitive).
It seems to me that your currency parser is a parser with only a few parameters and that you could probably create a method (def ... end) that returns currency parsers tailored to your liking. Maybe even use initialize and constructor arguments? (ie: MoneyParser.new(4,5))
For more help, please address your questions to the mailing list. Such questions are often easier to answer if you illustrate it with code.
也许我的部分解决方案将有助于澄清我在问题中的意思。
假设您有一个不平凡的解析器:
现在,如果我们想要解析固定宽度的货币字符串;这不是最容易做的事情。当然,您可以准确地弄清楚如何根据最终宽度来表达重复表达式,但这确实变得不必要的棘手,尤其是在逗号分隔的情况下。另外,在我的用例中,货币实际上只是一个例子。我希望能够有一种简单的方法来为地址、邮政编码等提供固定宽度的定义......
这似乎应该可以由 PEG 处理。我设法使用 Lookahead 作为模板:
当然,这是一个非常破解的解决方案。除此之外,行号和错误消息在固定宽度约束内并不好。我很乐意看到这个想法以更好的方式实现。
Maybe my partial solution will help to clarify what I meant in the question.
Let's say you have a somewhat non-trivial parser:
Now if we want to parse a fixed-width Currency string; this isn't the easiest thing to do. Of course, you could figure out exactly how to express the repeat expressions in terms of the final width, but it gets really unnecessarily tricky, especially in the comma separated case. Also, in my use case, currency is really just one example. I want to be able to have an easy way to come up with fixed-width definitions for adresses, zip codes, etc....
This seems like something that should be handle-able by a PEG. I managed to write a prototype version, using Lookahead as a template:
Of course, this is a pretty hacked solution. Among other things, line numbers and error messages are not good inside of a fixed width constraint. I would love to see this idea implemented in a better fashion.