在 Ruby 中解析制表符分隔文件的最佳方法是什么?

发布于 2024-10-07 02:29:04 字数 38 浏览 2 评论 0原文

在 Ruby 中解析制表符分隔文件的最佳(最有效)方法是什么?

What's the best (most efficient) way to parse a tab-delimited file in Ruby?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

薔薇婲 2024-10-14 02:29:04

Ruby CSV 库允许您指定字段分隔符。 Ruby 1.9 使用 FasterCSV。像这样的事情会起作用:

require "csv"
parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")

The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:

require "csv"
parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")
不如归去 2024-10-14 02:29:04

TSV 的规则实际上与 CSV 有点不同。主要区别在于 CSV 规定在字段内粘贴逗号,然后在字段内使用引号字符和转义引号。我写了一个简单的示例来展示简单响应如何失败:

require 'csv'
line = 'boogie\ttime\tis "now"'
begin
  line = CSV.parse_line(line, col_sep: "\t")
  puts "parsed correctly"
rescue CSV::MalformedCSVError
  puts "failed to parse line"
end

begin
  line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")
  puts "parsed correctly with random quote char"
rescue CSV::MalformedCSVError
  puts "failed to parse line with random quote char"
end

#Output:
# failed to parse line
# parsed correctly with random quote char

如果您想使用 CSV 库,您可以使用一个随机引号字符,您不希望在您的文件中看到该字符(示例显示了这一点),但您也可以使用更简单的方法(如下面所示的 StrictTsv 类)可以获得相同的效果,而不必担心字段引用。

# The main parse method is mostly borrowed from a tweet by @JEG2
class StrictTsv
  attr_reader :filepath
  def initialize(filepath)
    @filepath = filepath
  end

  def parse
    open(filepath) do |f|
      headers = f.gets.strip.split("\t")
      f.each do |line|
        fields = Hash[headers.zip(line.split("\t"))]
        yield fields
      end
    end
  end
end

# Example Usage
tsv = Vendor::StrictTsv.new("your_file.tsv")
tsv.parse do |row|
  puts row['named field']
end

选择使用 CSV 库或更严格的库仅取决于向您发送文件的人以及他们是否希望遵守严格的 TSV 标准。

有关 TSV 标准的详细信息,请访问 http://en.wikipedia.org/wiki/Tab-separated_values

The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:

require 'csv'
line = 'boogie\ttime\tis "now"'
begin
  line = CSV.parse_line(line, col_sep: "\t")
  puts "parsed correctly"
rescue CSV::MalformedCSVError
  puts "failed to parse line"
end

begin
  line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")
  puts "parsed correctly with random quote char"
rescue CSV::MalformedCSVError
  puts "failed to parse line with random quote char"
end

#Output:
# failed to parse line
# parsed correctly with random quote char

If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.

# The main parse method is mostly borrowed from a tweet by @JEG2
class StrictTsv
  attr_reader :filepath
  def initialize(filepath)
    @filepath = filepath
  end

  def parse
    open(filepath) do |f|
      headers = f.gets.strip.split("\t")
      f.each do |line|
        fields = Hash[headers.zip(line.split("\t"))]
        yield fields
      end
    end
  end
end

# Example Usage
tsv = Vendor::StrictTsv.new("your_file.tsv")
tsv.parse do |row|
  puts row['named field']
end

The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.

Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values

梅倚清风 2024-10-14 02:29:04

实际上有两种不同类型的 TSV 文件。

  1. TSV 文件实际上是 CSV 文件,分隔符设置为 Tab。例如,当您将 Excel 电子表格另存为“UTF-16 Unicode 文本”时,您会得到此信息。此类文件使用 CSV 引用规则,这意味着字段可以包含制表符和换行符,只要它们被引用即可,并且文字双引号会被写入两次。正确解析所有内容的最简单方法是使用 csv gem:

    使用“csv”
    parsed = CSV.read("file.tsv", col_sep: "\t")
    
  2. 符合 IANA 标准。不允许使用制表符和换行符作为字段值,并且不存在任何引用。例如,当您选择整个 Excel 电子表格并将其粘贴到文本文件中时,您会得到这样的结果(注意:如果某些单元格确实包含制表符或换行符,则会出现混乱)。可以使用简单的 line.rstrip.split("\t", -1) 轻松地逐行解析此类 TSV 文件(注意 -1,这会阻止split 删除空尾随字段)。如果您想使用 csv gem,只需将 quote_char 设置为 nil

    使用“csv”
    parsed = CSV.read("file.tsv", col_sep: "\t", quote_char: nil)
    

There are actually two different kinds of TSV files.

  1. TSV files that are actually CSV files with a delimiter set to Tab. This is something you'll get when you e.g. save an Excel spreadsheet as "UTF-16 Unicode Text". Such files use CSV quoting rules, which means that fields may contain tabs and newlines, as long as they are quoted, and literal double quotes are written twice. The easiest way to parse everything correctly is to use the csv gem:

    use 'csv'
    parsed = CSV.read("file.tsv", col_sep: "\t")
    
  2. TSV files conforming to the IANA standard. Tabs and newlines are not allowed as field values, and there is no quoting whatsoever. This is something you will get when you e.g. select a whole Excel spreadsheet and paste it into a text file (beware: it will get messed up if some cells do contain tabs or newlines). Such TSV files can be easily parsed line-by-line with a simple line.rstrip.split("\t", -1) (note -1, which prevents split from removing empty trailing fields). If you want to use the csv gem, simply set quote_char to nil:

    use 'csv'
    parsed = CSV.read("file.tsv", col_sep: "\t", quote_char: nil)
    
狠疯拽 2024-10-14 02:29:04

我喜欢 mmmries 的回答。然而,我讨厌 ruby​​ 从分割末尾去除任何空值的方式。它也不会删除行末尾的换行符。

另外,我有一个文件,其中一个字段内可能有换行符。因此,我重写了他的“解析”,如下所示:

def parse
  open(filepath) do |f|
    headers = f.gets.strip.split("\t")
    f.each do |line|
      myline=line
      while myline.scan(/\t/).count != headers.count-1
        myline+=f.gets
      end
      fields = Hash[headers.zip(myline.chomp.split("\t",headers.count))]
      yield fields
    end
  end
end

这会根据需要连接任何行以获得完整的数据行,并且始终返回完整的数据集(最后没有潜在的零条目)。

I like mmmries answer. HOWEVER, I hate the way that ruby strips off any empty values off of the end of a split. It isn't stripping off the newline at the end of the lines, either.

Also, I had a file with potential newlines within a field. So, I rewrote his 'parse' as follows:

def parse
  open(filepath) do |f|
    headers = f.gets.strip.split("\t")
    f.each do |line|
      myline=line
      while myline.scan(/\t/).count != headers.count-1
        myline+=f.gets
      end
      fields = Hash[headers.zip(myline.chomp.split("\t",headers.count))]
      yield fields
    end
  end
end

This concatenates any lines as necessary to get a full line of data, and always returns the full set of data (without potential nil entries at the end).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文