CSV 的替代品?

发布于 2024-09-26 08:22:00 字数 490 浏览 8 评论 0原文

我打算构建一个 RESTful 服务,它将返回自定义文本格式。鉴于我的数据量非常大,XML/JSON 太冗长了。我正在寻找一种基于行的文本格式。

CSV 是一个明显的候选者。不过我想知道是否还有更好的东西。我通过一些研究发现的唯一结果是 CTX 和 < a href="http://www.fieldedtext.org/" rel="noreferrer">字段文本。

我正在寻找一种提供以下功能的格式:

  • 纯文本,易于阅读
  • 非常容易被大多数软件平台解析
  • 列定义可以更改,无需在软件客户端中进行更改

字段文本看起来相当不错,我绝对可以自己构建规范,但我很想知道其他人做了什么,因为这一定是一个非常古老的问题。令人惊讶的是,没有更好的标准。

您有什么建议?

I intend to build a RESTful service which will return a custom text format. Given my very large volumes of data, XML/JSON is too verbose. I'm looking for a row based text format.

CSV is an obvious candidate. I'm however wondering if there isn't something better out there. The only I've found through a bit of research is CTX and Fielded Text.

I'm looking for a format which offers the following:

  • Plain text, easy to read
  • very easy to parse by most software platforms
  • column definition can change without requiring changes in software clients

Fielded text is looking pretty good and I could definitely build a specification myself, but I'm curious to know what others have done given that this must be a very old problem. It's surprising that there isn't a better standard out there.

What suggestions do you have?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

世界如花海般美丽 2024-10-03 08:22:00

我确信您已经考虑过这一点,但我喜欢制表符分隔的文件(字段之间用 \t,每行末尾换行符)

I'm sure you've already considered this, but I'm a fan of tab-delimited files (\t between fields, newline at the end of each row)

陌路终见情 2024-10-03 08:22:00

我想说,既然 CSV 是标准,既然阳光下的每个人都可以解析它,那就使用它。

如果我处于您的情况,我会承受带宽的影响并使用 GZIP+XML,只是因为它非常容易使用。

而且,在这一点上,您始终可以要求您的用户支持 GZIP 并将其作为 XML/JSON 发送,因为这应该可以很好地消除网络上的冗余。

I would say that since CSV is the standard, and since everyone under the sun can parse it, use it.

If I were in your situation, I would take the bandwidth hit and use GZIP+XML, just because it's so darn easy to use.

And, on that note, you could always require that your users support GZIP and just send it as XML/JSON, since that should do a pretty good job of removing the redundancy accross the wire.

淡忘如思 2024-10-03 08:22:00

您可以尝试 YAML,与 XML 或 JSON 等格式相比,它的开销相对较小。

示例如下: http://www.yaml.org/

令人惊讶的是,该网站的文本本身就是 YAML。

You could try YAML, its overhead is relatively small compared to formats such as XML or JSON.

Examples here: http://www.yaml.org/

Surprisingly, the website's text itself is YAML.

未央 2024-10-03 08:22:00

我已经思考这个问题有一段时间了。我想出了一种非常适合您的用例的简单格式:JTable。

 {
    "header": ["Column1", "Column2", "Column3"],
    "rows"  : [
                ["aaa", "xxx", 1],
                ["bbb", “yyy”, 2],
                ["ccc", “zzz”, 3]
              ]
  }

如果您愿意,您可以找到JTable 格式的完整规范,其中包含详细信息和资源。但这是不言自明的,任何程序员都会知道如何处理它。事实上,唯一需要做的就是说,这是 JSON。

I have been thinking on that problem for a while. I came up with a simple format that could work very well for your use case: JTable.

 {
    "header": ["Column1", "Column2", "Column3"],
    "rows"  : [
                ["aaa", "xxx", 1],
                ["bbb", “yyy”, 2],
                ["ccc", “zzz”, 3]
              ]
  }

If you wish, you can find a complete specification of the JTable format, with details and resources. But this is pretty self-explanatory and any programmer would know how to handle it. The only thing necessary is, really, to say, that this is JSON.

忆悲凉 2024-10-03 08:22:00

浏览现有的答案,大多数让我觉得有点过时了。特别是在“大数据”方面,值得注意的 CSV 替代品包括:

  • ORC:“优化行列式”使用行存储,在 Python/Pandas 中很有用。源自 HIVE,由 Hortonworks 优化。架构位于页脚中。维基百科条目目前非常简洁 https://en.wikipedia.org/wiki/Apache_ORC但 Apache 有很多细节

  • Parquet :同样基于列,具有类似的压缩。通常与 Cloudera Impala 一起使用。

  • Avro :来自 Apache Hadoop。基于行,但使用 Json 架构。 Pandas 的支持能力较差。常见于 Apache Kafka 集群中。

所有内容都是可拆分的,所有内容对人们来说都是难以理解的,所有内容都用模式描述其内容,并且所有内容都与 Hadoop 一起使用。在经常读取累积数据的情况下,基于列的格式被认为是最好的;对于多次写入,Avro 可能更适合。请参阅例如 https://www.datanami.com /2018/05/16/big-data-file-formats-demystified/

列格式的压缩可以使用SNAPPY(更快)或GZIP(速度较慢,但​​压缩程度更高)。

您可能还想研究 Protocol Buffers、Pickle(特定于 Python)和 Feather(用于 Python 和 R 之间的快速通信)。

Looking through the existing answers, most struck me as a bit dated. Especially in terms of 'big data', noteworthy alternatives to CSV include:

  • ORC : 'Optimised Row Columnar' uses row storage, useful in Python/Pandas. Originated in HIVE, optimised by Hortonworks. Schema is in the footer. The Wikipedia entry is currently quite terse https://en.wikipedia.org/wiki/Apache_ORC but Apache has a lot of detail.

  • Parquet : Similarly column-based, with similar compression. Often used with Cloudera Impala.

  • Avro : from Apache Hadoop. Row-based, but uses a Json schema. Less capable support in Pandas. Often found in Apache Kafka clusters.

All are splittable, all are inscrutable to people, all describe their content with a schema, and all work with Hadoop. The column-based formats are considered best where cumulated data are read often; with multiple writes, Avro may be more suited. See e.g. https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

Compression of the column formats can use SNAPPY (faster) or GZIP (slower but more compression).

You may also want to look into Protocol Buffers, Pickle (Python-specific) and Feather (for fast communication between Python and R).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文