在 Ruby 中解析制表符分隔文件的最佳方法是什么?
在 Ruby 中解析制表符分隔文件的最佳(最有效)方法是什么?
What's the best (most efficient) way to parse a tab-delimited file in Ruby?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
在 Ruby 中解析制表符分隔文件的最佳(最有效)方法是什么?
What's the best (most efficient) way to parse a tab-delimited file in Ruby?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(4)
Ruby CSV 库允许您指定字段分隔符。 Ruby 1.9 使用 FasterCSV。像这样的事情会起作用:
The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:
TSV 的规则实际上与 CSV 有点不同。主要区别在于 CSV 规定在字段内粘贴逗号,然后在字段内使用引号字符和转义引号。我写了一个简单的示例来展示简单响应如何失败:
如果您想使用 CSV 库,您可以使用一个随机引号字符,您不希望在您的文件中看到该字符(示例显示了这一点),但您也可以使用更简单的方法(如下面所示的 StrictTsv 类)可以获得相同的效果,而不必担心字段引用。
选择使用 CSV 库或更严格的库仅取决于向您发送文件的人以及他们是否希望遵守严格的 TSV 标准。
有关 TSV 标准的详细信息,请访问 http://en.wikipedia.org/wiki/Tab-separated_values
The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:
If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.
The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.
Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values
实际上有两种不同类型的 TSV 文件。
TSV 文件实际上是 CSV 文件,分隔符设置为 Tab。例如,当您将 Excel 电子表格另存为“UTF-16 Unicode 文本”时,您会得到此信息。此类文件使用 CSV 引用规则,这意味着字段可以包含制表符和换行符,只要它们被引用即可,并且文字双引号会被写入两次。正确解析所有内容的最简单方法是使用
csv
gem:符合 IANA 标准。不允许使用制表符和换行符作为字段值,并且不存在任何引用。例如,当您选择整个 Excel 电子表格并将其粘贴到文本文件中时,您会得到这样的结果(注意:如果某些单元格确实包含制表符或换行符,则会出现混乱)。可以使用简单的
line.rstrip.split("\t", -1)
轻松地逐行解析此类 TSV 文件(注意-1
,这会阻止split
删除空尾随字段)。如果您想使用csv
gem,只需将quote_char
设置为nil
:There are actually two different kinds of TSV files.
TSV files that are actually CSV files with a delimiter set to Tab. This is something you'll get when you e.g. save an Excel spreadsheet as "UTF-16 Unicode Text". Such files use CSV quoting rules, which means that fields may contain tabs and newlines, as long as they are quoted, and literal double quotes are written twice. The easiest way to parse everything correctly is to use the
csv
gem:TSV files conforming to the IANA standard. Tabs and newlines are not allowed as field values, and there is no quoting whatsoever. This is something you will get when you e.g. select a whole Excel spreadsheet and paste it into a text file (beware: it will get messed up if some cells do contain tabs or newlines). Such TSV files can be easily parsed line-by-line with a simple
line.rstrip.split("\t", -1)
(note-1
, which preventssplit
from removing empty trailing fields). If you want to use thecsv
gem, simply setquote_char
tonil
:我喜欢 mmmries 的回答。然而,我讨厌 ruby 从分割末尾去除任何空值的方式。它也不会删除行末尾的换行符。
另外,我有一个文件,其中一个字段内可能有换行符。因此,我重写了他的“解析”,如下所示:
这会根据需要连接任何行以获得完整的数据行,并且始终返回完整的数据集(最后没有潜在的零条目)。
I like mmmries answer. HOWEVER, I hate the way that ruby strips off any empty values off of the end of a split. It isn't stripping off the newline at the end of the lines, either.
Also, I had a file with potential newlines within a field. So, I rewrote his 'parse' as follows:
This concatenates any lines as necessary to get a full line of data, and always returns the full set of data (without potential nil entries at the end).