如何在 C++ 中解析 tar 文件

发布于 2024-08-26 11:34:46 字数 401 浏览 6 评论 0原文

我想要做的是下载一个包含多个目录的 .tar 文件,每个目录有 2 个文件。问题是我无法找到一种方法来读取 tar 文件而不实际提取文件(使用 tar)。

完美的解决方案是这样的:

#include <easytar>

Tarfile tar("somefile.tar");
std::string currentFile, currentFileName;
for(int i=0; i<tar.size(); i++){
  file = tar.getFileText(i);
  currentFileName = tar.getFileName(i);
  // do stuff with it
}

我可能必须自己写这个,但任何想法将不胜感激。

What I want to do is download a .tar file with multiple directories with 2 files each. The problem is I can't find a way to read the tar file without actually extracting the files (using tar).

The perfect solution would be something like:

#include <easytar>

Tarfile tar("somefile.tar");
std::string currentFile, currentFileName;
for(int i=0; i<tar.size(); i++){
  file = tar.getFileText(i);
  currentFileName = tar.getFileName(i);
  // do stuff with it
}

I'm probably going to have to write this myself, but any ideas would be appreciated..

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

十秒萌定你 2024-09-02 11:34:46

经过一番工作后,我自己发现了这一点。 tar 文件规范 实际上告诉您需要了解的所有内容。

首先,每个文件都以 512 字节标头开头,因此您可以使用指向较大 char 数组中某处的 char[512] 或 char* 来表示它(例如,如果您将整个文件加载到一个数组中)。

标头如下所示:

location  size  field
0         100   File name
100       8     File mode
108       8     Owner's numeric user ID
116       8     Group's numeric user ID
124       12    File size in bytes
136       12    Last modification time in numeric Unix time format
148       8     Checksum for header block
156       1     Link indicator (file type)
157       100   Name of linked file

因此,如果您想要文件名,可以使用 string filename(buffer[0], 100); 在这里获取它。文件名是空填充的,因此您可以进行检查以确保至少有一个空值,然后如果您想节省空间,则可以忽略大小。

现在我们想知道它是文件还是文件夹。 “链接指示符”字段具有此信息,因此:

// Note that we're comparing to ascii numbers, not ints
switch(buffer[156]){
    case '0': // intentionally dropping through
    case '\0':
        // normal file
        break;
    case '1':
        // hard link
        break;
    case '2':
        // symbolic link
        break;
    case '3':
        // device file/special file
        break;
    case '4':
        // block device
        break;
    case '5':
        // directory
        break;
    case '6':
        // named pipe
        break;
}

此时,我们已经拥有了所需的有关目录的所有信息,但我们还需要普通文件中的一件事:实际的文件内容。

文件的长度可以用两种不同的方式存储,要么是作为 0 或空格填充的空终止八进制字符串,要么是“通过设置最左边字节的高位来指示的 Base-256 编码”数字字段”。

数值使用 ASCII 数字以八进制编码,并带有前导零。由于历史原因,应使用最后的 NUL 或空格字符。因此,虽然保留了 12 个字节用于存储文件大小,但只能存储 11 个八进制数字。这使得存档文件的最大文件大小为 8 GB。为了克服这个限制,star于2001年引入了base-256编码,通过设置数字字段最左边字节的高位来指示。 GNU-tar 和 BSD-tar 遵循了这个想法。此外,1988 年第一个 POSIX 标准之前的 tar 版本用空格而不是零填充值。

以下是读取八进制格式的方法,但我还没有为 base-256 版本编写代码:

// in one function
int size_of_file = octal_string_to_int(&buffer[124], 11);

// elsewhere
int octal_string_to_int(char *current_char, unsigned int size){
    unsigned int output = 0;
    while(size > 0){
        output = output * 8 + *current_char - '0';
        current_char++;
        size--;
    }
    return output;
}

好的,现在我们拥有了除实际文件内容之外的所有内容。我们所要做的就是从 tar 文件中获取下一个 size 字节的数据,我们将获得文件内容:

// Get to the next block after the header ends
location += 512;
file_contents = new char[size];
memcpy(file_contents, &buffer[location], size);
// Go to the next block by rounding up to 512
// This isn't necessarily the most efficient way to do this,
// but it's the most obvious.
location += (int)ceil(size / 512.0)

I figured this out myself after a bit of work. The tar file spec actually tells you everything you need to know.

First off, every file starts with a 512 byte header, so you can represent it with a char[512] or a char* pointing at somewhere in your larger char array (if you have the entire file loaded into one array for example).

The header looks like this:

location  size  field
0         100   File name
100       8     File mode
108       8     Owner's numeric user ID
116       8     Group's numeric user ID
124       12    File size in bytes
136       12    Last modification time in numeric Unix time format
148       8     Checksum for header block
156       1     Link indicator (file type)
157       100   Name of linked file

So if you want the file name, you grab it right here with string filename(buffer[0], 100);. The file name is null padded, so you could do a check to make sure there's at least one null and then leave off the size if you want to save space.

Now we want to know if it's a file or a folder. The "link indicator" field has this information, so:

// Note that we're comparing to ascii numbers, not ints
switch(buffer[156]){
    case '0': // intentionally dropping through
    case '\0':
        // normal file
        break;
    case '1':
        // hard link
        break;
    case '2':
        // symbolic link
        break;
    case '3':
        // device file/special file
        break;
    case '4':
        // block device
        break;
    case '5':
        // directory
        break;
    case '6':
        // named pipe
        break;
}

At this point, we already have all of the information we need about directories, but we need one more thing from normal files: the actual file contents.

The length of the file can be stored in two different ways, either as a 0-or-space-padded null-terminated octal string, or "a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field".

Numeric values are encoded in octal numbers using ASCII digits, with leading zeroes. For historical reasons, a final NUL or space character should be used. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation, star in 2001 introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field. GNU-tar and BSD-tar followed this idea. Additionally, versions of tar from before the first POSIX standard from 1988 pad the values with spaces instead of zeroes.

Here's how you would read the octal format, but I haven't written code for the base-256 version:

// in one function
int size_of_file = octal_string_to_int(&buffer[124], 11);

// elsewhere
int octal_string_to_int(char *current_char, unsigned int size){
    unsigned int output = 0;
    while(size > 0){
        output = output * 8 + *current_char - '0';
        current_char++;
        size--;
    }
    return output;
}

Ok, so now we have everything except the actual file contents. All we have to do is grab the next size bytes of data from the tar file and we'll have our file contents:

// Get to the next block after the header ends
location += 512;
file_contents = new char[size];
memcpy(file_contents, &buffer[location], size);
// Go to the next block by rounding up to 512
// This isn't necessarily the most efficient way to do this,
// but it's the most obvious.
location += (int)ceil(size / 512.0)
无名指的心愿 2024-09-02 11:34:46

您看过 libtar 吗?

来自 fink 包信息:

libtar-1.2-1:Tar 文件操作 API
libtar 是一个用于操作 POSIX tar 文件的 C 库。它处理
将文件添加到 tar 存档中或从 tar 存档中提取文件。
libtar 提供以下功能:
* 灵活的 API - 您可以操作单个文件或只是
一次提取整个档案。
* 允许用户指定的read()和write()函数,例如
zlib 的 gzread() 和 gzwrite()。
* 支持 POSIX 1003.1-1990 和 GNU tar 文件格式。

本身不是 C++,但你可以很容易地链接到 C...

Have you looked at libtar?

From the fink package info:

libtar-1.2-1: Tar file manipulation API
libtar is a C library for manipulating POSIX tar files. It handles
adding and extracting files to/from a tar archive.
libtar offers the following features:
* Flexible API - you can manipulate individual files or just
extract a whole archive at once.
* Allows user-specified read() and write() functions, such as
zlib's gzread() and gzwrite().
* Supports both POSIX 1003.1-1990 and GNU tar file formats.

Not c++ per se, but you can link to c pretty easily...

要走干脆点 2024-09-02 11:34:46

libarchive 可以是解析 tarball 的开源库。 Libarchive可以从归档文件中读取每个文件而无需解压,也可以写入数据以形成新的归档文件。

libarchive can be the open source library to parse the tarball. Libarchive can read each files from an archive file without extraction, and also it can write data to form a new archive file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文