如何进一步处理导致 Ruby FasterCSV 库抛出 MalformedCSVError 的数据行？

发布于 2024-12-08 13:39:50 字数 483 浏览 5 评论 0原文

传入数据文件包含格式错误的 CSV 数据（例如非转义引号）以及（有效）CSV 数据（例如包含新行的字段）。如果检测到 CSV 格式错误，我想对该数据使用替代例程。

使用以下示例代码（为简单起见进行了缩写）

FasterCSV.open( file ){|csv|
  row = true
  while row
    begin
      row = csv.shift
      break unless row
      # Do things with the good rows here...

    rescue FasterCSV::MalformedCSVError => e
      # Do things with the bad rows here...
      next
    end
  end
}

MalformedCSVError 是在 csv.shift 方法中引起的。如何从救援子句中访问导致错误的数据？

原文

The incoming data file(s) contain malformed CSV data such as non-escaped quotes, as well as (valid) CSV data such as fields containing new lines. If a CSV format error is detected I would like to use an alternative routine on that data.

With the following sample code (abbreviated for simplicity)

FasterCSV.open( file ){|csv|
  row = true
  while row
    begin
      row = csv.shift
      break unless row
      # Do things with the good rows here...

    rescue FasterCSV::MalformedCSVError => e
      # Do things with the bad rows here...
      next
    end
  end
}

The MalformedCSVError is caused in the csv.shift method. How can I access the data that caused the error from the rescue clause?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷心人i 2024-12-15 13:39:50

require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV

# File.open('test.txt','r').each do |line|
DATA.each do |line|
  begin
    CSV.parse(line) do |row|
      p row #handle row
    end
  rescue  CSV::MalformedCSVError => er
    puts er.message
    puts "This one: #{line}"
    # and continue
  end
end

# Output:

# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]   

__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid

只需将文件逐行输入 FasterCSV 并修复错误即可。

require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV

# File.open('test.txt','r').each do |line|
DATA.each do |line|
  begin
    CSV.parse(line) do |row|
      p row #handle row
    end
  rescue  CSV::MalformedCSVError => er
    puts er.message
    puts "This one: #{line}"
    # and continue
  end
end

# Output:

# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]   

__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid

Just feed the file line by line to FasterCSV and rescue the error.

回复收藏 0 原文

北城挽邺 2024-12-15 13:39:50

这将非常困难。一些使 FasterCSV 变得更快的因素使得这变得特别困难。这是我最好的建议：FasterCSV 可以包装 IO 对象。那么，您可以做的是创建自己的File子类（本身是IO的子类）来“保留”结果最后一个获取 。然后，当 FasterCSV 引发异常时，您可以向特殊的 File 对象询问最后一行。像这样：

class MyFile < File
  attr_accessor :last_gets
  @last_gets = ''

  def gets(*args)
    line = super
    @last_gets << $/ << line
    line
  end
end

# then...

file  = MyFile.open(filename, 'r')
csv   = FasterCSV.new file

row = true
while row
  begin
    break unless row = csv.shift

    # do things with the good row here...

  rescue FasterCSV::MalformedCSVError => e
    bad_row = file.last_gets

    # do something with bad_row here...

    next
  ensure
    file.last_gets = '' # nuke the @last_gets "buffer"
  end
end

有点整洁，对吧？ 但是！当然有一些警告：

我不确定当您向每个 gets 调用添加额外的步骤时，性能会受到多大影响。如果您需要及时解析数百万行文件，这可能是一个问题。
如果您的 CSV 文件在带引号的字段中包含换行符，此~~完全失败~~可能会失败，也可能不会失败。原因是源代码中描述 --基本上，如果带引号的值包含换行符，则 shift 必须执行额外的 gets 调用才能获取整行。可能有一种聪明的方法可以绕过这个限制，但我现在还没有想到。如果您确定您的文件在带引号的字段中没有任何换行符，那么您不必担心。

您的另一个选项是使用 File.gets 读取文件并将每一行依次传递给 FasterCSV#parse_line 但我很确定这样做会浪费使用 FasterCSV 获得的任何性能优势。

This is going to be really difficult. Some things that make FasterCSV, well, faster, make this particularly hard. Here's my best suggestion: FasterCSV can wrap an IO object. What you could do, then, is to make your own subclass of File (itself a subclass of IO) that "holds onto" the result of the last gets. Then when FasterCSV raises an exception you can ask your special File object for the last line. Something like this:

class MyFile < File
  attr_accessor :last_gets
  @last_gets = ''

  def gets(*args)
    line = super
    @last_gets << $/ << line
    line
  end
end

# then...

file  = MyFile.open(filename, 'r')
csv   = FasterCSV.new file

row = true
while row
  begin
    break unless row = csv.shift

    # do things with the good row here...

  rescue FasterCSV::MalformedCSVError => e
    bad_row = file.last_gets

    # do something with bad_row here...

    next
  ensure
    file.last_gets = '' # nuke the @last_gets "buffer"
  end
end

Kinda neat, right? BUT! there are caveats, of course:

I'm not sure how much of a performance hit you take when you add an extra step to every gets call. It might be an issue if you need to parse multi-million-line files in a timely fashion.
This ~~fails utterly~~ might or might not fail if your CSV file contains newline characters inside quoted fields. The reason for this is described in the source--basically, if a quoted value contains a newline then shift has to do additional gets calls to get the entire line. There could be a clever way around this limitation but it's not coming to me right now. If you're sure your file doesn't have any newline characters within quoted fields then this shouldn't be a worry for you, though.

Your other option would be to read the file using File.gets and pass each line in turn to FasterCSV#parse_line but I'm pretty sure in so doing you'd squander any performance advantage gained from using FasterCSV.

回复收藏 0 原文

み格子的夏天 2024-12-15 13:39:50

在 CSV 尝试解析输入数据之前，我使用 Jordan 的文件子类化方法来解决输入数据的问题。就我而言，我有一个文件使用 \" 来转义引号，而不是 CSV 期望的 ""。因此，

class MyFile < File
  def gets(*args)
    line = super
    if line != nil
      line.gsub!('\\"','""')  # fix the \" that would otherwise cause a parse error
    end
    line
  end
end

infile = MyFile.open(filename)
incsv = CSV.new(infile)

while row = infile.shift
  # process each row here
end

这使我能够解析非标准 CSV 文件。Ruby 的 CSV 实现非常严格，并且经常遇到问题CSV 格式的多种变体。

I used Jordan's file subclassing approach to fix the problem with my input data before CSV ever tries to parse it. In my case, I had a file that used \" to escape quotes, instead of the "" that CSV expects. Hence,

class MyFile < File
  def gets(*args)
    line = super
    if line != nil
      line.gsub!('\\"','""')  # fix the \" that would otherwise cause a parse error
    end
    line
  end
end

infile = MyFile.open(filename)
incsv = CSV.new(infile)

while row = infile.shift
  # process each row here
end

This allowed me to parse the non-standard CSV file. Ruby's CSV implementation is very strict and often has trouble with the many variants of the CSV format.

回复收藏 0 原文

~没有更多了~