将二进制数据（从文件）读取到结构中

发布于 2024-09-28 18:47:39 字数 2551 浏览 5 评论 0原文

我正在从文件中读取二进制数据，特别是从 zip 文件中读取。（要了解有关 zip 格式结构的更多信息，请参阅 http://en.wikipedia.org/wiki /ZIP_%28file_format%29)

我创建了一个存储数据的结构：

typedef struct {
                                            /*Start Size            Description                                 */
    int signatute;                          /*   0  4   Local file header signature = 0x04034b50                */
    short int version;                      /*   4  2   Version needed to extract (minimum)                     */
    short int bit_flag;                     /*   6  2   General purpose bit flag                                */
    short int compression_method;           /*   8  2   Compression method                                      */
    short int time;                         /*  10  2   File last modification time                             */
    short int date;                         /*  12  2   File last modification date                             */
    int crc;                                /*  14  4   CRC-32                                                  */
    int compressed_size;                    /*  18  4   Compressed size                                         */
    int uncompressed_size;                  /*  22  4   Uncompressed size                                       */
    short int name_length;                  /*  26  2   File name length (n)                                    */
    short int extra_field_length;           /*  28  2   Extra field length (m)                                  */
    char *name;                             /*  30  n   File name                                               */
    char *extra_field;                      /*30+n  m   Extra field                                             */

} ZIP_local_file_header;

sizeof(ZIP_local_file_header) 返回的大小是 40，但是如果每个字段的总和是用sizeof 运算符的总大小为 38。

如果我们有下一个结构：

typedef struct {
    short int x;
    int y;
} FOO;

sizeof(FOO) 返回 8，因为每次分配内存 4 个字节。因此，分配x保留4个字节（但实际大小是2个字节）。如果我们需要另一个 short int ，它将填充先前分配的剩余 2 个字节。但由于我们有一个 int ，它将被分配加上 4 个字节，而空的 2 个字节被浪费了。

要从文件中读取数据，我们可以使用函数fread：

ZIP_local_file_header p;
fread(&p,sizeof(ZIP_local_file_header),1,file);

但是由于中间有空字节，所以无法正确读取。

如何才能顺序有效地存储数据且不浪费任何字节的 ZIP_local_file_header ？

原文

I'm reading binary data from a file, specifically from a zip file. (To know more about the zip format structure see http://en.wikipedia.org/wiki/ZIP_%28file_format%29)

I've created a struct that stores the data:

typedef struct {
                                            /*Start Size            Description                                 */
    int signatute;                          /*   0  4   Local file header signature = 0x04034b50                */
    short int version;                      /*   4  2   Version needed to extract (minimum)                     */
    short int bit_flag;                     /*   6  2   General purpose bit flag                                */
    short int compression_method;           /*   8  2   Compression method                                      */
    short int time;                         /*  10  2   File last modification time                             */
    short int date;                         /*  12  2   File last modification date                             */
    int crc;                                /*  14  4   CRC-32                                                  */
    int compressed_size;                    /*  18  4   Compressed size                                         */
    int uncompressed_size;                  /*  22  4   Uncompressed size                                       */
    short int name_length;                  /*  26  2   File name length (n)                                    */
    short int extra_field_length;           /*  28  2   Extra field length (m)                                  */
    char *name;                             /*  30  n   File name                                               */
    char *extra_field;                      /*30+n  m   Extra field                                             */

} ZIP_local_file_header;

The size returned by sizeof(ZIP_local_file_header) is 40, but if the sum of each field is calculated with sizeof operator the total size is 38.

If we have the next struct:

typedef struct {
    short int x;
    int y;
} FOO;

sizeof(FOO) returns 8 because the memory is allocated with 4 bytes every time. So, to allocate x are reserved 4 bytes (but the real size is 2 bytes). If we need another short int it will fill the resting 2 bytes of the previous allocation. But as we have an int it will be allocated plus 4 bytes and the empty 2 bytes are wasted.

To read data from file, we can use the function fread:

ZIP_local_file_header p;
fread(&p,sizeof(ZIP_local_file_header),1,file);

But as there're empty bytes in the middle, it isn't read correctly.

What can I do to sequentially and efficiently store data with ZIP_local_file_header wasting no bytes?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清音悠歌 2024-10-05 18:47:39

为了满足底层平台的对齐要求，结构体可以在成员之间有“填充”字节，以便每个成员从正确对齐的地址开始。

有几种方法可以解决这个问题：一种是使用适当大小的成员单独读取标头的每个元素：

fread(&p.signature, sizeof p.signature, 1, file);
fread(&p.version, sizeof p.version, 1, file);
...

另一种是在结构定义中使用位字段；这些不受填充限制。缺点是位字段必须是 unsigned int 或 int，或者从 C99 开始，为 _Bool；您可能必须将原始数据转换为目标类型才能正确解释它：

typedef struct {                 
    unsigned int signature          : 32;
    unsigned int version            : 16;                
    unsigned int bit_flag;          : 16;                
    unsigned int compression_method : 16;              
    unsigned int time               : 16;
    unsigned int date               : 16;
    unsigned int crc                : 32;
    unsigned int compressed_size    : 32;                 
    unsigned int uncompressed_size  : 32;
    unsigned int name_length        : 16;    
    unsigned int extra_field_length : 16;
} ZIP_local_file_header;

如果文件是以大端字节序写入的，但您的系统是小端字节序的，您可能还必须在每个成员中进行一些字节交换。

请注意，name 和 extra field 不是结构定义的一部分；当您从文件中读取时，您不会读取名称和额外字段的指针值，您将读取文件的实际内容名称和额外字段。由于在阅读标题的其余部分之前您不知道这些字段的大小，因此您应该推迟阅读它们，直到阅读上面的结构之后。像这样的东西

ZIP_local_file_header p;
char *name = NULL;
char *extra = NULL;
...
fread(&p, sizeof p, 1, file);
if (name = malloc(p.name_length + 1))
{
    fread(name, p.name_length, 1, file);
    name[p.name_length] = 0;
}
if (extra = malloc(p.extra_field_length + 1))
{
    fread(extra, p.extra_field_length, 1, file);
    extra[p.extra_field_length] = 0;
}

In order to meet the alignment requirements of the underlying platform, structs may have "padding" bytes between members so that each member starts at a properly aligned address.

There are several ways around this: one is to read each element of the header separately using the appropriately-sized member:

fread(&p.signature, sizeof p.signature, 1, file);
fread(&p.version, sizeof p.version, 1, file);
...

Another is to use bit fields in your struct definition; these are not subject to padding restrictions. The downside is that bit fields must be unsigned int or int or, as of C99, _Bool; you may have to cast the raw data to the target type to interpret it correctly:

typedef struct {                 
    unsigned int signature          : 32;
    unsigned int version            : 16;                
    unsigned int bit_flag;          : 16;                
    unsigned int compression_method : 16;              
    unsigned int time               : 16;
    unsigned int date               : 16;
    unsigned int crc                : 32;
    unsigned int compressed_size    : 32;                 
    unsigned int uncompressed_size  : 32;
    unsigned int name_length        : 16;    
    unsigned int extra_field_length : 16;
} ZIP_local_file_header;

You may also have to do some byte-swapping in each member if the file was written in big-endian but your system is little-endian.

Note that name and extra field aren't part of the struct definition; when you read from the file, you're not going to be reading pointer values for the name and extra field, you're going to be reading the actual contents of the name and extra field. Since you don't know the sizes of those fields until you read the rest of the header, you should defer reading them until after you've read the structure above. Something like

ZIP_local_file_header p;
char *name = NULL;
char *extra = NULL;
...
fread(&p, sizeof p, 1, file);
if (name = malloc(p.name_length + 1))
{
    fread(name, p.name_length, 1, file);
    name[p.name_length] = 0;
}
if (extra = malloc(p.extra_field_length + 1))
{
    fread(extra, p.extra_field_length, 1, file);
    extra[p.extra_field_length] = 0;
}

回复收藏 0 原文

缘字诀 2024-10-05 18:47:39

C 结构体只是将相关的数据分组在一起，它们不指定内存中的特定布局。（就像 int 的宽度也未定义一样。）Little-endian/Big-endian 也未定义，并且取决于处理器。

不同的编译器、不同体系结构或操作系统上的相同编译器等，都会以不同的方式布局结构。

由于您要读取的文件格式是根据哪些字节去哪里来定义的，因此结构虽然看起来非常方便且诱人，但并不是正确的解决方案。您需要将文件视为 char[] 并提取所需的字节并移动它们，以便使数字由多个字节等组成。

回复收藏 0 原文

缱绻入梦 2024-10-05 18:47:39

该解决方案是特定于编译器的，但例如在 GCC 中，您可以通过将 __attribute__((packed)) 附加到定义来强制它更紧密地打包结构。请参阅 http://gcc.gnu.org/onlinedocs/ gcc-3.2.3/gcc/Type-Attributes.html。

回复收藏 0 原文

中性美 2024-10-05 18:47:39

我已经有一段时间没有使用 zip 压缩文件了，但我确实记得添加自己的填充以符合 PowerPC arch 的 4 字节对齐规则的做法。

最好的情况下，您只需将结构体的每个元素定义为您要读入的数据块的大小。不要只使用“int”，因为平台/编译器可能会定义不同的大小。

在标头中执行类似以下操作：

typedef unsigned long   unsigned32;
typedef unsigned short  unsigned16;
typedef unsigned char   unsigned8;
typedef unsigned char   byte;

然后，不要只使用 int，而是使用 unsigned32，其中您有一个已知的 4 字节值。任何已知的 2 字节值都是 unsigned16。

这将帮助您了解在哪里可以添加填充字节以实现 4 字节对齐，或者在哪里可以将 2 个 2 字节元素分组以构成 4 字节对齐。

理想情况下，您可以使用最少的填充字节（可用于稍后在扩展程序时添加附加数据），或者如果您可以将所有内容与末尾的可变长度数据对齐到 4 字节边界，则根本不使用填充字节。

It's been a while since I worked with zip-compressed files, but I do remember the practice of adding my own padding to hit the 4-byte alignment rules of PowerPC arch.

At best you simply need to define each element of your struct to the size of the piece of data you want to read in. Don't just use 'int' as that may be platform/compiler defined to various sizes.

Do something like this in a header:

typedef unsigned long   unsigned32;
typedef unsigned short  unsigned16;
typedef unsigned char   unsigned8;
typedef unsigned char   byte;

Then instead of just int use an unsigned32 where you have a known 4-byte vaule. And unsigned16 for any known 2-byte values.

This will help you see where you can add padding bytes to hit 4-byte alignment, or where you can group 2, 2-byte elements to make up a 4-byte alignment.

Ideally you can use a minimum of padding bytes (which can be used to add additional data later as your expand the program) or none at all if you can align everything to 4-byte boundaries with variable-length data at the end.

回复收藏 0 原文