Java 中的 CSV 自动检测

发布于 2024-12-22 02:59:00 字数 1770 浏览 6 评论 0原文

如果 CSV 被重新定义为“字符分隔值”，即使用任何单个字符（但通常是任何非字母数字符号）的数据，那么自动检测文件实际上是 CSV 的可靠方法是什么作为分隔符而不仅仅是逗号？

本质上，通过这个（重新）定义，CSV = DSV（“分隔符分隔值”），例如在此维基百科文章，而“逗号分隔值”格式在 RFC 4180。

更具体地说，是否有一种方法可以统计推断数据具有某种“固定”长度，即“可能的 CSV”？仅计算分隔符的数量并不总是有效，因为存在 每条记录的字段数量不同的 CSV 文件（即，与 RFC 4180 的要求相反，记录的字段数量不同）同一文件中的字段）。

CSV 识别似乎是一个特别具有挑战性的问题，特别是如果检测不能基于文件扩展名（例如，当读取无论如何都没有此类信息的流时）。

正确的（“完整”）自动检测至少需要做出 4 个决定才能可靠地做出：

检测文件实际上是 CSV
检测标头是否存在
检测实际分隔符
检测特殊字符（例如引号）

完整由于其他数据集的相似性（例如，使用逗号的自由文本），自动检测似乎没有单一的解决方案，特别是对于像可变长度记录、单引号或双引号字段或多行记录。

因此，最好的方法似乎是伸缩检测，即在应用 CSV 检测规则之前检查也可归类为 CSV 的格式（例如 Apache CLF 等日志文件格式）。

即使像 Excel 这样的商业应用程序似乎也依赖文件扩展名 (.csv) 来决定 (1)，这显然不是自动检测，尽管如果应用程序被告知数据是 CSV，问题会大大简化。

以下是一些讨论 (2) 和 (3) 启发式的优秀相关文章：

(4)（引号类型）的检测可以基于处理文件中的几行并查找相应的值（例如，每行偶数个 ' 或 " 表示单个或双引号）。此类处理可以通过初始化现有的 CSV 解析器（例如，OpenCSV）来完成，该解析器将妥善处理 CSV行分离（例如，多行事件）。

但是（1）呢，即首先决定数据是 CSV？

数据挖掘可以帮助做出这个决定吗？？

原文

What would be a reliable way of autodetecting that a file is actually CSV, if CSV was redefined to mean "Character-Separated Values", i.e. data using any single character (but typically any non-alphanumeric symbol) as the delimiter and not only commas?

Essentially, with this (re)definition, CSV = DSV ("Delimiter-Separated Values"), discussed, for example, in this Wikipedia article, whereas the "Comma-Separated Values" format is defined in RFC 4180.

More specifically, is there a method for statistically deducting that the data is of somehow "fixed" length, meaning "possible CSV"? Just counting the number of delimiters does not always work, because there are CSV files with variable numbers of fields per record (i.e., records that, opposite to what RFC 4180 mandates, do not have the same number of fields across the same file).

CSV recognition seems to be a particularly challenging problem, especially if detection cannot based on the file extension (e.g., when reading a stream that does not have such information anyway).

Proper ("full") autodetection needs at least 4 decisions to be made reliably:

Detecting that a file is actually CSV
Detecting the presence of headers
Detecting the actual separator character
Detecting special characters (e.g., quotes)

Full autodetection seems to have no single solution, due to the similarities of other datasets (e.g., free text that uses commas), especially for corner cases like variable length records, single or double quoted fields, or multiline records.

So, the best approach seems to be telescopic detection, in which formats that can also be classified as CSV (e.g., log file formats like the Apache CLF) are examined before the application of the CSV detection rules.

Even commercial applications like Excel seem to rely on the file extension (.csv) in order to decide for (1), which is obviously no autodetection, although the problem is greatly simplified if the application is told that the data is CSV.

Here are some good relevant articles discussing heuristics for (2) and (3):

The detection of (4), the type of quotes, can be based on processing a few lines from the file and looking for corresponding values (e.g., an even number of ' or " per row would mean single or double quotes). Such processing can be done via initializing an existing CSV parser (e.g., OpenCSV) that will take proper care of CSV row separation (e.g., multiline events).

But what about (1), i.e., deciding that the data is CSV in the first place?

Could data mining help in this decision?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

墨洒年华 2024-12-29 02:59:00

如果您无法限制用作分隔符的内容，那么您可以使用暴力。

您可以遍历引号字符、列分隔符和记录分隔符的所有可能组合（对于 ASCII，为 256 * 255 * 254 = 16581120）。

id,text,date
1,"Bob says, ""hi
..."", with a sigh",1/1/2012

删除所有带引号的列，这可以通过正则表达式替换来完成。

//quick javascript example of the regex, you'd replace the quote char with whichever character your currently testing
var test='id,text,date\n1,"bob, ""hi\n..."", sigh",1/1/2011';
console.log(test.replace(/"(""|.|\n|\r)*?"/gm,""));

id,text,date
1,,1/1/2012

按记录分隔符拆分

["id,text,date", "1,,1/1/2012"]

按列分隔符拆分记录

[ ["id", "text", "date"], ["1", "", "1/1/2012"] ]

如果数字每条记录的列数匹配，您对 CSV 有一定的信心。

3 == 3

如果列数不匹配，请尝试行、列和引号字符的其他组合

编辑

< strong>实际解析后的数据您对分隔符有信心，并且检查列类型一致性可能是一个有用的额外步骤

第一个（标题？）行中的所有列都是字符串
吗 X 列总是解析为 null/空或有效的 (int 、浮点数、日期）

要处理的 CSV 数据（行、列）越多，您可以从此方法中提取的信心就越大。

我认为这个问题有点愚蠢/过于笼统，如果你有一个你肯定想要的未知数据流首先检查所有“容易实现的目标”。二进制格式通常具有相当独特的标头签名，然后还有 XML 和 JSON 来轻松检测文本格式。

If you can't constrain whats used as a delimiter then you can use brute-force.

You could iterate through all possible combinations of quote character, column delimiter, and record delimiter (256 * 255 * 254 = 16581120 for ASCII).

id,text,date
1,"Bob says, ""hi
..."", with a sigh",1/1/2012

Remove all quoted columns, this can be done with a RegEx replace.

//quick javascript example of the regex, you'd replace the quote char with whichever character your currently testing
var test='id,text,date\n1,"bob, ""hi\n..."", sigh",1/1/2011';
console.log(test.replace(/"(""|.|\n|\r)*?"/gm,""));

id,text,date
1,,1/1/2012

Split on record delimiter

["id,text,date", "1,,1/1/2012"]

Split records on column delimiter

[ ["id", "text", "date"], ["1", "", "1/1/2012"] ]

If the number of columns per record match you have some CSV confidence.

3 == 3

If the number of columns don't match try another combination of row, column and quote character

EDIT

Actually parsing the data after you have confidence on the delimiters and checking for column type uniformity might be a useful extra step

Are all the columns in the first (header?) row strings
Does column X always parse out to null/empty or a valid (int, float, date)

The more CSV data (rows, columns) there is to work with, the more confidence you can extract from this method.

I think this question is kind of silly / overly general, if you have a stream of unknown data you'd definitely want to check for all of the "low hanging fruit" first. Binary formats usually have fairly distinct header signatures, then there's XML and JSON for easily detectable text formats.

回复收藏 0 原文

白鸥掠海 2024-12-29 02:59:00

总会有看起来像 CSV 的非 CSV 文件，反之亦然。例如，frankc 在您引用的 Java 链接中发布了病态（但完全有效）的 CSV 文件：

Name
Jim
Tom
Bill

我认为，最好的方法是对文件是 CSV 的可能性进行某种启发式估计。我能想到的一些启发是：

每一行上都会出现一个候选分隔符（或者，如果您愿意，每一行都有一个标记）。
给定候选分隔符，大多数（但不一定是全部）行具有相同数量的字段。
第一行的存在（看起来可能是标题）增加了文件包含 CSV 数据的可能性。

人们也许可以想出其他启发法。然后的方法是开发基于这些的评分算法。下一步是对已知 CSV 和非 CSV 文件的集合进行评分。如果存在足够清晰的分离，那么评分可能被认为是有用的，并且分数应该告诉您如何设置检测阈值。

There are always going to be non-CSV files that look like CSV, and vice versa. For instance, there's the pathological (but perfectly valid) CSV file that frankc posted in the Java link you cited:

Name
Jim
Tom
Bill

The best one can do, I think, is some sort of heuristic estimate of the likelihood that a file is CSV. Some heuristics I can think of are:

There is a candidate separator character that appears on every line (or, if you like, every line has one token).
Given a candidate separator character, most (but not necessarily all) of the lines have the same number of fields.
The presence of a first line that looks like it might be a header increases the likelihood of the file containing CSV data.

One can probably think up other heuristics. The approach would then be to develop a scoring algorithm based on these. The next step would be to score a collection of known CSV and non-CSV files. If there is a clear-enough separation, then the scoring could be deemed useful and the scores should tell you how to set a detection threshold.

回复收藏 0 原文

~没有更多了~