嵌入平面文件模式的良好约定

发布于 2024-08-26 04:47:23 字数 367 浏览 8 评论 0原文

我们收到大量平面文件数据:带分隔符的记录或固定长度的记录。有时很难找出文件实际包含的内容。

是否有任何既定的做法可以将文件的模式嵌入到文件的开头或结尾以使文件不言自明?

为了得到一个想法,想象一下这样的事情:

<data name=test records=2 type=fixed>
   <field name=foo start=0 length=2 type=numeric>
   <field name=bar start=2 length=4 type=text>
</data>
11test
12ing 

我们将在开始时解析 xml 并使用它来读取记录。

We receive lots of data as flat files: delimitted or just fixed length records. It's sometimes hard to find out what the files actually contain.

Are there any well established practices for embedding the schema of the file to the beginning or the end of a file to make the file self-explanatory?

Just to get an idea, imagine something like this:

<data name=test records=2 type=fixed>
   <field name=foo start=0 length=2 type=numeric>
   <field name=bar start=2 length=4 type=text>
</data>
11test
12ing 

We would parse the xml in the beginning and use it for reading the records.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

一页 2024-09-02 04:47:23

据我所知,没有——或者至少不是很大。

我唯一知道的(就广泛接受的标准而言)是数据文件的第一行是列名称 - 至少对于分隔记录来说,对于固定长度来说更困难,特别是如果您的数据可以包含多个记录类型(我发现固定长度比定界更有可能)。

从我的立场来看,我建议您不能真正将定义嵌入到文件中,我假设您从外部源获取数据,因此您不太可能从他们那里获得帮助,即使您这样做了,您也会立即创建挑战,因为您无法(例如)在必要时使用 Excel 轻松打开文件。

稍微横向思考一下,如果使用 XML,您可以将文件嵌入到定义中(一大块 CDATA)。这是一个稍微更实用的解决方案,因为它对外部数据进行了包装,而不要求修改数据本身。不确定这有多实用——但对我来说,这比相反更好。

So far as I'm aware no - or at least not hugely.

The only thing I'm aware of (in terms of a widely accepted standard) is for the first row of the data file to be the column names - at least for delimited records, for fixed length its harder especially if your data can contain multiple record types (which I've found to be far more likely with fixed length than with delimited).

From where I sit I'd suggest that you can't really embed the definition into the file I'm assuming you're getting data from external sources so you're unlikely to get help from them and even if you do you immediately create challenges as you can't (for example) easily open the files with Excel if necessary.

Thinking a bit laterally you could - if using XML - potentially embed the file into the definition (big lump of CDATA). This is a slightly more practical solution as its putting a wrapper round your external data not asking that the data itself be modified. Not sure how practical this is - but it feels better to me than the other way round.

旧伤还要旧人安 2024-09-02 04:47:23

您是否曾在协议缓冲区中寻找灵感?

have you looked at Protocol Buffers for inspiration?

蓝天 2024-09-02 04:47:23

我不知道任何既定的做法,但您将模式添加到数据的想法似乎很好。 Apache Avro 是一个类似于 Protocol Buffers 和 Thrift 的数据序列化工具。我相信典型的 Avro 用法涉及将模式与数据一起存储(我猜是通过将其添加到流中)。

我还想提一下 PADS 项目。他们有一种模式语言,旨在让您描述“临时”数据格式。目前我相信他们只有 C 和 ML 实现,这可能是一个问题。另一方面,他们的模式语言被设计为处理多种格式,因此与您自己的基于 XML 的东西相比,它仍然可能值得使用。

I don't know about any established practice, but your idea of just prepending the schema to the data seems fine. Apache Avro is a data serialization tool similar to Protocol Buffers and Thrift. I believe typical Avro usage involves storing the schema with the data (by prepending it in the stream, I'd guess).

I wanted to also mention the PADS project. They have a schema language designed to let you describe "ad-hoc" data formats. Currently I believe they only have C and ML implementations, which may be a problem. On the other hand, their schema language was designed to handle a wide variety of formats, so it still might be worth using it over your own XML-based thing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文