如何在 Parslet 中定义固定宽度约束

发布于 2024-10-29 13:25:10 字数 383 浏览 7 评论 0原文

我正在研究 parslet 来编写大量数据导入代码。总的来说,图书馆看起来不错,但我正在努力解决一件事。我们的许多输入文件都是固定宽度的,并且格式之间的宽度有所不同,即使实际字段没有。例如,我们可能会得到一个包含 9 个字符货币的文件,另一个包含 11 个字符(或其他)的文件。有谁知道如何在 Parslet 原子上定义固定宽度约束?

理想情况下,我希望能够定义一个能够理解货币的原子(带有可选的美元符号、千位分隔符等...)然后我就能够基于旧原子即时创建一个新原子这是完全等价的,只是它正好解析 N 个字符。

Parslet 中是否存在这样的组合器?如果没有,我自己写一个可能/困难吗?

I am looking into parslet to write alot of data import code. Overall, the library looks good, but I'm struggling with one thing. Alot of our input files are fixed width, and the widths differ between formats, even if the actual field doesn't. For example, we might get a file that has a 9-character currency, and another that has 11-characters (or whatever). Does anyone know how to define a fixed width constraint on a parslet atom?

Ideally, I would like to be able to define an atom that understands currency (with optional dollar signs, thousand separators, etc...) And then I would be able to, on the fly, create a new atom based on the old one that is exactly equivalent, except that it parses exactly N characters.

Does such a combinator exist in parslet? If not, would it be possible/difficult to write one myself?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

那片花海 2024-11-05 13:25:10

像这样的事情怎么样...

class MyParser < Parslet::Parser
    def initialize(widths)
        @widths = widths
        super
    end

    rule(:currency)  {...}
    rule(:fixed_c)   {currency.fixed(@widths[:currency])}


    rule(:fixed_str) {str("bob").fixed(4)}
end 

puts MyParser.new.fixed_str.parse("bob").inspect

这将会失败:

"Expected 'bob' to be 4 long at line 1 char 1"

以下是你如何做到的:

require 'parslet'

class Parslet::Atoms::FixedLength < Parslet::Atoms::Base  
  attr_reader :len, :parslet
  def initialize(parslet, len, tag=:length)
    super()

    raise ArgumentError, 
      "Asking for zero length of a parslet. (#{parslet.inspect} length #{len})" \
      if len == 0

    @parslet = parslet
    @len = len
    @tag = tag
    @error_msgs = {
      :lenrep  => "Expected #{parslet.inspect} to be #{len} long", 
      :unconsumed => "Extra input after last repetition"
    }
  end

  def try(source, context, consume_all)
    start_pos = source.pos

    success, value = parslet.apply(source, context, false)

    return succ(value) if success && value.str.length == @len

    context.err_at(
      self, 
      source, 
      @error_msgs[:lenrep], 
      start_pos, 
      [value]) 
  end

  precedence REPETITION
  def to_s_inner(prec)
    parslet.to_s(prec) + "{len:#{@len}}"
  end
end

module Parslet::Atoms::DSL
  def fixed(len)
    Parslet::Atoms::FixedLength.new(self, len)
  end
end

What about something like this...

class MyParser < Parslet::Parser
    def initialize(widths)
        @widths = widths
        super
    end

    rule(:currency)  {...}
    rule(:fixed_c)   {currency.fixed(@widths[:currency])}


    rule(:fixed_str) {str("bob").fixed(4)}
end 

puts MyParser.new.fixed_str.parse("bob").inspect

This will fail with:

"Expected 'bob' to be 4 long at line 1 char 1"

Here's how you do it:

require 'parslet'

class Parslet::Atoms::FixedLength < Parslet::Atoms::Base  
  attr_reader :len, :parslet
  def initialize(parslet, len, tag=:length)
    super()

    raise ArgumentError, 
      "Asking for zero length of a parslet. (#{parslet.inspect} length #{len})" \
      if len == 0

    @parslet = parslet
    @len = len
    @tag = tag
    @error_msgs = {
      :lenrep  => "Expected #{parslet.inspect} to be #{len} long", 
      :unconsumed => "Extra input after last repetition"
    }
  end

  def try(source, context, consume_all)
    start_pos = source.pos

    success, value = parslet.apply(source, context, false)

    return succ(value) if success && value.str.length == @len

    context.err_at(
      self, 
      source, 
      @error_msgs[:lenrep], 
      start_pos, 
      [value]) 
  end

  precedence REPETITION
  def to_s_inner(prec)
    parslet.to_s(prec) + "{len:#{@len}}"
  end
end

module Parslet::Atoms::DSL
  def fixed(len)
    Parslet::Atoms::FixedLength.new(self, len)
  end
end
孤独岁月 2024-11-05 13:25:10

解析器类中的方法基本上是 Parlet 原子的生成器。这些方法最简单的形式是“规则”,即每次调用时仅返回相同原子的方法。创建您自己的生成器也同样容易,但它们并不是那么简单的野兽。请查看 http://kschiess.github.com/parslet/tricks.html此技巧的说明(匹配字符串不区分大小写)。

在我看来,您的货币解析器是一个只有几个参数的解析器,您可能可以创建一个方法(def ... end)来返回根据您的喜好定制的货币解析器。甚至可能使用初始化和构造函数参数? (即:MoneyParser.new(4,5))

如需更多帮助,请将您的问题发送至邮件列表。如果用代码来说明,这些问题通常更容易回答。

Methods in parser classes are basically generators for parslet atoms. The simplest form these methods come in are 'rule's, methods that just return the same atoms every time they are called. It is just as easy to create your own generators that are not such simple beasts. Please look at http://kschiess.github.com/parslet/tricks.html for an illustration of this trick (Matching strings case insensitive).

It seems to me that your currency parser is a parser with only a few parameters and that you could probably create a method (def ... end) that returns currency parsers tailored to your liking. Maybe even use initialize and constructor arguments? (ie: MoneyParser.new(4,5))

For more help, please address your questions to the mailing list. Such questions are often easier to answer if you illustrate it with code.

少年亿悲伤 2024-11-05 13:25:10

也许我的部分解决方案将有助于澄清我在问题中的意思。

假设您有一个不平凡的解析器:

class MyParser < Parslet::Parser
    rule(:dollars) {
        match('[0-9]').repeat(1).as(:dollars)
    }
    rule(:comma_separated_dollars) {
        match('[0-9]').repeat(1, 3).as(:dollars) >> ( match(',') >> match('[0-9]').repeat(3, 3).as(:dollars) ).repeat(1)
    }
    rule(:cents) {
        match('[0-9]').repeat(2, 2).as(:cents)
    }
    rule(:currency) {
        (str('

现在,如果我们想要解析固定宽度的货币字符串;这不是最容易做的事情。当然,您可以准确地弄清楚如何根据最终宽度来表达重复表达式,但这确实变得不必要的棘手,尤其是在逗号分隔的情况下。另外,在我的用例中,货币实际上只是一个例子。我希望能够有一种简单的方法来为地址、邮政编码等提供固定宽度的定义......

这似乎应该可以由 PEG 处理。我设法使用 Lookahead 作为模板:

class FixedWidth < Parslet::Atoms::Base
    attr_reader :bound_parslet
    attr_reader :width

    def initialize(width, bound_parslet) # :nodoc:
        super()

        @width = width
        @bound_parslet = bound_parslet
        @error_msgs = {
            :premature => "Premature end of input (expected #{width} characters)",
            :failed => "Failed fixed width",
        }
    end

    def try(source, context) # :nodoc:
        pos = source.pos
        teststring = source.read(width).to_s
        if (not teststring) || teststring.size != width
            return error(source, @error_msgs[:premature]) #if not teststring && teststring.size == width
        end
        fakesource = Parslet::Source.new(teststring)
        value = bound_parslet.apply(fakesource, context)
        return value if not value.error?

        source.pos = pos
        return error(source, @error_msgs[:failed])
    end

    def to_s_inner(prec) # :nodoc:
        "FIXED-WIDTH(#{width}, #{bound_parslet.to_s(prec)})"
    end

    def error_tree # :nodoc:
        Parslet::ErrorTree.new(self, bound_parslet.error_tree)
    end
end

# now we can easily define a fixed-width currency rule:
class SHPParser
    rule(:currency15) {
        FixedWidth.new(15, currency >> str(' ').repeat)
    }
end

当然,这是一个非常破解的解决方案。除此之外,行号和错误消息在固定宽度约束内并不好。我很乐意看到这个想法以更好的方式实现。

) >> (comma_separated_dollars | dollars) >> str('.') >> cents).as(:currency) # order is important in (comma_separated_dollars | dollars) } end

现在,如果我们想要解析固定宽度的货币字符串;这不是最容易做的事情。当然,您可以准确地弄清楚如何根据最终宽度来表达重复表达式,但这确实变得不必要的棘手,尤其是在逗号分隔的情况下。另外,在我的用例中,货币实际上只是一个例子。我希望能够有一种简单的方法来为地址、邮政编码等提供固定宽度的定义......

这似乎应该可以由 PEG 处理。我设法使用 Lookahead 作为模板:

当然,这是一个非常破解的解决方案。除此之外,行号和错误消息在固定宽度约束内并不好。我很乐意看到这个想法以更好的方式实现。

Maybe my partial solution will help to clarify what I meant in the question.

Let's say you have a somewhat non-trivial parser:

class MyParser < Parslet::Parser
    rule(:dollars) {
        match('[0-9]').repeat(1).as(:dollars)
    }
    rule(:comma_separated_dollars) {
        match('[0-9]').repeat(1, 3).as(:dollars) >> ( match(',') >> match('[0-9]').repeat(3, 3).as(:dollars) ).repeat(1)
    }
    rule(:cents) {
        match('[0-9]').repeat(2, 2).as(:cents)
    }
    rule(:currency) {
        (str('

Now if we want to parse a fixed-width Currency string; this isn't the easiest thing to do. Of course, you could figure out exactly how to express the repeat expressions in terms of the final width, but it gets really unnecessarily tricky, especially in the comma separated case. Also, in my use case, currency is really just one example. I want to be able to have an easy way to come up with fixed-width definitions for adresses, zip codes, etc....

This seems like something that should be handle-able by a PEG. I managed to write a prototype version, using Lookahead as a template:

class FixedWidth < Parslet::Atoms::Base
    attr_reader :bound_parslet
    attr_reader :width

    def initialize(width, bound_parslet) # :nodoc:
        super()

        @width = width
        @bound_parslet = bound_parslet
        @error_msgs = {
            :premature => "Premature end of input (expected #{width} characters)",
            :failed => "Failed fixed width",
        }
    end

    def try(source, context) # :nodoc:
        pos = source.pos
        teststring = source.read(width).to_s
        if (not teststring) || teststring.size != width
            return error(source, @error_msgs[:premature]) #if not teststring && teststring.size == width
        end
        fakesource = Parslet::Source.new(teststring)
        value = bound_parslet.apply(fakesource, context)
        return value if not value.error?

        source.pos = pos
        return error(source, @error_msgs[:failed])
    end

    def to_s_inner(prec) # :nodoc:
        "FIXED-WIDTH(#{width}, #{bound_parslet.to_s(prec)})"
    end

    def error_tree # :nodoc:
        Parslet::ErrorTree.new(self, bound_parslet.error_tree)
    end
end

# now we can easily define a fixed-width currency rule:
class SHPParser
    rule(:currency15) {
        FixedWidth.new(15, currency >> str(' ').repeat)
    }
end

Of course, this is a pretty hacked solution. Among other things, line numbers and error messages are not good inside of a fixed width constraint. I would love to see this idea implemented in a better fashion.

) >> (comma_separated_dollars | dollars) >> str('.') >> cents).as(:currency) # order is important in (comma_separated_dollars | dollars) } end

Now if we want to parse a fixed-width Currency string; this isn't the easiest thing to do. Of course, you could figure out exactly how to express the repeat expressions in terms of the final width, but it gets really unnecessarily tricky, especially in the comma separated case. Also, in my use case, currency is really just one example. I want to be able to have an easy way to come up with fixed-width definitions for adresses, zip codes, etc....

This seems like something that should be handle-able by a PEG. I managed to write a prototype version, using Lookahead as a template:

Of course, this is a pretty hacked solution. Among other things, line numbers and error messages are not good inside of a fixed width constraint. I would love to see this idea implemented in a better fashion.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文