如何进一步处理导致 Ruby FasterCSV 库抛出 MalformedCSVError 的数据行?
传入数据文件包含格式错误的 CSV 数据(例如非转义引号)以及(有效)CSV 数据(例如包含新行的字段)。如果检测到 CSV 格式错误,我想对该数据使用替代例程。
使用以下示例代码(为简单起见进行了缩写)
FasterCSV.open( file ){|csv|
row = true
while row
begin
row = csv.shift
break unless row
# Do things with the good rows here...
rescue FasterCSV::MalformedCSVError => e
# Do things with the bad rows here...
next
end
end
}
MalformedCSVError 是在 csv.shift 方法中引起的。如何从救援子句中访问导致错误的数据?
The incoming data file(s) contain malformed CSV data such as non-escaped quotes, as well as (valid) CSV data such as fields containing new lines. If a CSV format error is detected I would like to use an alternative routine on that data.
With the following sample code (abbreviated for simplicity)
FasterCSV.open( file ){|csv|
row = true
while row
begin
row = csv.shift
break unless row
# Do things with the good rows here...
rescue FasterCSV::MalformedCSVError => e
# Do things with the bad rows here...
next
end
end
}
The MalformedCSVError is caused in the csv.shift method. How can I access the data that caused the error from the rescue clause?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
只需将文件逐行输入 FasterCSV 并修复错误即可。
Just feed the file line by line to FasterCSV and rescue the error.
这将非常困难。一些使 FasterCSV 变得更快的因素使得这变得特别困难。这是我最好的建议:FasterCSV 可以包装 IO 对象。那么,您可以做的是创建自己的
File
子类(本身是IO
的子类)来“保留”结果最后一个获取
。然后,当 FasterCSV 引发异常时,您可以向特殊的File
对象询问最后一行。像这样:有点整洁,对吧? 但是!当然有一些警告:
我不确定当您向每个
gets
调用添加额外的步骤时,性能会受到多大影响。如果您需要及时解析数百万行文件,这可能是一个问题。如果您的 CSV 文件在带引号的字段中包含换行符,此
完全失败可能会失败,也可能不会失败。原因是源代码中描述 --基本上,如果带引号的值包含换行符,则shift
必须执行额外的gets
调用才能获取整行。可能有一种聪明的方法可以绕过这个限制,但我现在还没有想到。如果您确定您的文件在带引号的字段中没有任何换行符,那么您不必担心。您的另一个选项是使用 File.gets 读取文件并将每一行依次传递给
FasterCSV#parse_line
但我很确定这样做会浪费使用 FasterCSV 获得的任何性能优势。This is going to be really difficult. Some things that make FasterCSV, well, faster, make this particularly hard. Here's my best suggestion: FasterCSV can wrap an IO object. What you could do, then, is to make your own subclass of
File
(itself a subclass ofIO
) that "holds onto" the result of the lastgets
. Then when FasterCSV raises an exception you can ask your specialFile
object for the last line. Something like this:Kinda neat, right? BUT! there are caveats, of course:
I'm not sure how much of a performance hit you take when you add an extra step to every
gets
call. It might be an issue if you need to parse multi-million-line files in a timely fashion.This
fails utterlymight or might not fail if your CSV file contains newline characters inside quoted fields. The reason for this is described in the source--basically, if a quoted value contains a newline thenshift
has to do additionalgets
calls to get the entire line. There could be a clever way around this limitation but it's not coming to me right now. If you're sure your file doesn't have any newline characters within quoted fields then this shouldn't be a worry for you, though.Your other option would be to read the file using
File.gets
and pass each line in turn toFasterCSV#parse_line
but I'm pretty sure in so doing you'd squander any performance advantage gained from using FasterCSV.在 CSV 尝试解析输入数据之前,我使用 Jordan 的文件子类化方法来解决输入数据的问题。就我而言,我有一个文件使用 \" 来转义引号,而不是 CSV 期望的 ""。因此,
这使我能够解析非标准 CSV 文件。Ruby 的 CSV 实现非常严格,并且经常遇到问题CSV 格式的多种变体。
I used Jordan's file subclassing approach to fix the problem with my input data before CSV ever tries to parse it. In my case, I had a file that used \" to escape quotes, instead of the "" that CSV expects. Hence,
This allowed me to parse the non-standard CSV file. Ruby's CSV implementation is very strict and often has trouble with the many variants of the CSV format.