CSV 的替代品?
我打算构建一个 RESTful 服务,它将返回自定义文本格式。鉴于我的数据量非常大,XML/JSON 太冗长了。我正在寻找一种基于行的文本格式。
CSV 是一个明显的候选者。不过我想知道是否还有更好的东西。我通过一些研究发现的唯一结果是 CTX 和 < a href="http://www.fieldedtext.org/" rel="noreferrer">字段文本。
我正在寻找一种提供以下功能的格式:
- 纯文本,易于阅读
- 非常容易被大多数软件平台解析
- 列定义可以更改,无需在软件客户端中进行更改
字段文本看起来相当不错,我绝对可以自己构建规范,但我很想知道其他人做了什么,因为这一定是一个非常古老的问题。令人惊讶的是,没有更好的标准。
您有什么建议?
I intend to build a RESTful service which will return a custom text format. Given my very large volumes of data, XML/JSON is too verbose. I'm looking for a row based text format.
CSV is an obvious candidate. I'm however wondering if there isn't something better out there. The only I've found through a bit of research is CTX and Fielded Text.
I'm looking for a format which offers the following:
- Plain text, easy to read
- very easy to parse by most software platforms
- column definition can change without requiring changes in software clients
Fielded text is looking pretty good and I could definitely build a specification myself, but I'm curious to know what others have done given that this must be a very old problem. It's surprising that there isn't a better standard out there.
What suggestions do you have?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我确信您已经考虑过这一点,但我喜欢制表符分隔的文件(字段之间用 \t,每行末尾换行符)
I'm sure you've already considered this, but I'm a fan of tab-delimited files (\t between fields, newline at the end of each row)
我想说,既然 CSV 是标准,既然阳光下的每个人都可以解析它,那就使用它。
如果我处于您的情况,我会承受带宽的影响并使用 GZIP+XML,只是因为它非常容易使用。
而且,在这一点上,您始终可以要求您的用户支持 GZIP 并将其作为 XML/JSON 发送,因为这应该可以很好地消除网络上的冗余。
I would say that since CSV is the standard, and since everyone under the sun can parse it, use it.
If I were in your situation, I would take the bandwidth hit and use GZIP+XML, just because it's so darn easy to use.
And, on that note, you could always require that your users support GZIP and just send it as XML/JSON, since that should do a pretty good job of removing the redundancy accross the wire.
您可以尝试 YAML,与 XML 或 JSON 等格式相比,它的开销相对较小。
示例如下: http://www.yaml.org/
令人惊讶的是,该网站的文本本身就是 YAML。
You could try YAML, its overhead is relatively small compared to formats such as XML or JSON.
Examples here: http://www.yaml.org/
Surprisingly, the website's text itself is YAML.
我已经思考这个问题有一段时间了。我想出了一种非常适合您的用例的简单格式:JTable。
如果您愿意,您可以找到JTable 格式的完整规范,其中包含详细信息和资源。但这是不言自明的,任何程序员都会知道如何处理它。事实上,唯一需要做的就是说,这是 JSON。
I have been thinking on that problem for a while. I came up with a simple format that could work very well for your use case: JTable.
If you wish, you can find a complete specification of the JTable format, with details and resources. But this is pretty self-explanatory and any programmer would know how to handle it. The only thing necessary is, really, to say, that this is JSON.
浏览现有的答案,大多数让我觉得有点过时了。特别是在“大数据”方面,值得注意的 CSV 替代品包括:
ORC:“优化行列式”使用行存储,在 Python/Pandas 中很有用。源自 HIVE,由 Hortonworks 优化。架构位于页脚中。维基百科条目目前非常简洁 https://en.wikipedia.org/wiki/Apache_ORC但 Apache 有很多细节。
Parquet :同样基于列,具有类似的压缩。通常与 Cloudera Impala 一起使用。
Avro :来自 Apache Hadoop。基于行,但使用 Json 架构。 Pandas 的支持能力较差。常见于 Apache Kafka 集群中。
所有内容都是可拆分的,所有内容对人们来说都是难以理解的,所有内容都用模式描述其内容,并且所有内容都与 Hadoop 一起使用。在经常读取累积数据的情况下,基于列的格式被认为是最好的;对于多次写入,Avro 可能更适合。请参阅例如 https://www.datanami.com /2018/05/16/big-data-file-formats-demystified/
列格式的压缩可以使用SNAPPY(更快)或GZIP(速度较慢,但压缩程度更高)。
您可能还想研究 Protocol Buffers、Pickle(特定于 Python)和 Feather(用于 Python 和 R 之间的快速通信)。
Looking through the existing answers, most struck me as a bit dated. Especially in terms of 'big data', noteworthy alternatives to CSV include:
ORC : 'Optimised Row Columnar' uses row storage, useful in Python/Pandas. Originated in HIVE, optimised by Hortonworks. Schema is in the footer. The Wikipedia entry is currently quite terse https://en.wikipedia.org/wiki/Apache_ORC but Apache has a lot of detail.
Parquet : Similarly column-based, with similar compression. Often used with Cloudera Impala.
Avro : from Apache Hadoop. Row-based, but uses a Json schema. Less capable support in Pandas. Often found in Apache Kafka clusters.
All are splittable, all are inscrutable to people, all describe their content with a schema, and all work with Hadoop. The column-based formats are considered best where cumulated data are read often; with multiple writes, Avro may be more suited. See e.g. https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/
Compression of the column formats can use SNAPPY (faster) or GZIP (slower but more compression).
You may also want to look into Protocol Buffers, Pickle (Python-specific) and Feather (for fast communication between Python and R).