如何找到“中央目录”的开头?在 zip 文件中?

发布于 2024-10-14 21:50:15 字数 316 浏览 7 评论 0 原文

维基百科对 ZIP 文件格式有很好的描述,但是“中心目录”结构让我感到困惑。具体来说是这样的:

此顺序允许一次创建 ZIP 文件,但通常通过首先读取最后的中心目录来解压缩。

问题在于,即使是中央目录的尾随标头也是可变长度的。那么,如何才能获得要解析的中央目录的开头呢?

(哦,在来这里询问之前,我确实花了一些时间徒劳地查看 APPNOTE.TXT :P)

Wikipedia has an excellent description of the ZIP file format, but the "central directory" structure is confusing to me. Specifically this:

This ordering allows a ZIP file to be created in one pass, but it is usually decompressed by first reading the central directory at the end.

The problem is that even the trailing header for the central directory is variable length. How then, can someone get the start of the central directory to parse?

(Oh, and I did spend some time looking at APPNOTE.TXT in vain before coming here and asking :P)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

执手闯天涯 2024-10-21 21:50:15

我的哀悼,阅读维基百科的描述给我一个非常强烈的印象,你需要做大量的猜测+检查工作:

从末尾向后寻找 0x06054b50 目录结束标记,向前查找 16 个字节以找到偏移量目录起始标记 0x02014b50,希望就是这样。您可以进行一些健全性检查,例如在目录结束标记后查找注释长度和注释字符串标记,但感觉 Zip 解码器确实有效,因为人们不会将有趣的字符放入 zip 注释、文件名等中四。无论如何,完全基于维基百科页面。

My condolences, reading the wikipedia description gives me the very strong impression that you need to do a fair amount of guess + check work:

Hunt backwards from the end for the 0x06054b50 end-of-directory tag, look forward 16 bytes to find the offset for the start-of-directory tag 0x02014b50, and hope that is it. You could do some sanity checks like looking for the comment length and comment string tags after the end-of-directory tag, but it sure feels like Zip decoders work because people don't put funny characters into their zip comments, filenames, and so forth. Based entirely on the wikipedia page, anyhow.

浅浅 2024-10-21 21:50:15

我前段时间正在实现 zip 存档支持,并且我在最后几千字节中搜索中央目录签名的末尾(4 个字节)。这效果非常好,直到有人将 50kb 文本放入注释中(这不太可能发生。绝对可以肯定,您可以搜索最后 64kb + 几个字节,因为注释大小是 16 位)。
之后,我查找中央目录定位器的 zip64 端,这更容易,因为它具有固定的结构。

I was implementing zip archive support some time ago, and I search last few kilobytes for a end of central directory signature (4 bytes). That works pretty good, until somebody will put 50kb text into comment (which is unlikely to happen. To be absolutely sure, you can search last 64kb + few bytes, since comment size is 16 bit).
After that, I look up for zip64 end of central dir locator, that's easier since it has fixed structure.

街道布景 2024-10-21 21:50:15

这是我必须推出的解决方案,以防有人需要它。这涉及到获取中央目录。

就我而言,我不想要任何 zip 解决方案中提供的任何压缩功能。我只是想了解一下内容。以下代码将返回 ZipArchive,其中包含 zip 中每个条目的列表。

它还使用最少量的文件访问和内存分配。

TinyZip.cpp

#include "TinyZip.h"
#include <cstdio>

namespace TinyZip
{
#define VALID_ZIP_SIGNATURE 0x04034b50
#define CENTRAL_DIRECTORY_EOCD 0x06054b50 //signature
#define CENTRAL_DIRECTORY_ENTRY_SIGNATURE 0x02014b50
#define PTR_OFFS(type, mem, offs) *((type*)(mem + offs)) //SHOULD BE OK 

    typedef struct {
        unsigned int signature : 32;
        unsigned int number_of_disk : 16;
        unsigned int disk_where_cd_starts : 16;
        unsigned int number_of_cd_records : 16;
        unsigned int total_number_of_cd_records : 16;
        unsigned int size_of_cd : 32;
        unsigned int offset_of_start : 32;
        unsigned int comment_length : 16;
    } ZipEOCD;

    ZipArchive* ZipArchive::GetArchive(const char *filepath)
    {
        FILE *pFile = nullptr;
#ifdef WIN32
        errno_t err;
        if ((err = fopen_s(&pFile, filepath, "rb")) == 0)
#else
        if ((pFile = fopen(filepath, "rb")) == NULL)
#endif
        {
            int fileSignature = 0;
            //Seek to start and read zip header
            fread(&fileSignature, sizeof(int), 1, pFile);
            if (fileSignature != VALID_ZIP_SIGNATURE) return false;

            //Grab the file size
            long fileSize = 0;
            long currPos = 0;

            fseek(pFile, 0L, SEEK_END);
            fileSize = ftell(pFile);
            fseek(pFile, 0L, SEEK_SET);

            //Step back the size of the ZipEOCD 
            //If it doesn't have any comments, should get an instant signature match
            currPos = fileSize;
            int signature = 0;
            while (currPos > 0)
            {
                fseek(pFile, currPos, SEEK_SET);
                fread(&signature, sizeof(int), 1, pFile);
                if (signature == CENTRAL_DIRECTORY_EOCD)
                {
                    break;
                }
                currPos -= sizeof(char); //step back one byte
            }

            if (currPos != 0)
            {
                ZipEOCD zipOECD;
                fseek(pFile, currPos, SEEK_SET);
                fread(&zipOECD, sizeof(ZipEOCD), 1, pFile);

                long memBlockSize = fileSize - zipOECD.offset_of_start;

                //Allocate zip archive of size
                ZipArchive *pArchive = new ZipArchive(memBlockSize);

                //Read in the whole central directory (also includes the ZipEOCD...)
                fseek(pFile, zipOECD.offset_of_start, SEEK_SET);
                fread((void*)pArchive->m_MemBlock, memBlockSize - 10, 1, pFile);
                long currMemBlockPos = 0;
                long currNullTerminatorPos = -1;
                while (currMemBlockPos < memBlockSize)
                {
                    int sig = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos);
                    if (sig != CENTRAL_DIRECTORY_ENTRY_SIGNATURE)
                    {
                        if (sig == CENTRAL_DIRECTORY_EOCD) return pArchive;
                        return nullptr; //something went wrong
                    }

                    if (currNullTerminatorPos > 0)
                    {
                        pArchive->m_MemBlock[currNullTerminatorPos] = '\0';
                        currNullTerminatorPos = -1;
                    }

                    const long offsToFilenameLen = 28;
                    const long offsToFieldLen = 30;
                    const long offsetToFilename = 46;

                    int filenameLength = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos + offsToFilenameLen);
                    int extraFieldLen = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos + offsToFieldLen);
                    const char *pFilepath = &pArchive->m_MemBlock[currMemBlockPos + offsetToFilename];
                    currNullTerminatorPos = (currMemBlockPos + offsetToFilename) + filenameLength;
                    pArchive->m_Entries.push_back(pFilepath);

                    currMemBlockPos += (offsetToFilename + filenameLength + extraFieldLen);
                }

                return pArchive;
            }
        }
        return nullptr;
    }

    ZipArchive::ZipArchive(long size)
    {
        m_MemBlock = new char[size];
    }

    ZipArchive::~ZipArchive()
    {
        delete[] m_MemBlock;
    }

    const std::vector<const char*>  &ZipArchive::GetEntries()
    {
        return m_Entries;
    }
}

TinyZip.h

#ifndef __TinyZip__
#define __TinyZip__

#include <vector>
#include <string>

namespace TinyZip
{
    class ZipArchive
    {
    public:
        ZipArchive(long memBlockSize);
        ~ZipArchive();

        static ZipArchive* GetArchive(const char *filepath);

        const std::vector<const char*>  &GetEntries();

    private:
        std::vector<const char*> m_Entries;
        char *m_MemBlock;
    };

}


#endif

用法:

 TinyZip::ZipArchive *pArchive = TinyZip::ZipArchive::GetArchive("Scripts_unencrypt.pak");
 if (pArchive != nullptr)
 {
     const std::vector<const char*> entries = pArchive->GetEntries();
     for (auto entry : entries)
     {
         //do stuff
     }
 }

Here is a solution I have just had to roll out incase anybody needs this. This involves grabbing the central directory.

In my case I did not want any of the compression features that are offered in any of the zip solutions. I just wanted to know about the contents. The following code will return a ZipArchive of a listing of every entry in the zip.

It also uses a minimum amount of file access and memory allocation.

TinyZip.cpp

#include "TinyZip.h"
#include <cstdio>

namespace TinyZip
{
#define VALID_ZIP_SIGNATURE 0x04034b50
#define CENTRAL_DIRECTORY_EOCD 0x06054b50 //signature
#define CENTRAL_DIRECTORY_ENTRY_SIGNATURE 0x02014b50
#define PTR_OFFS(type, mem, offs) *((type*)(mem + offs)) //SHOULD BE OK 

    typedef struct {
        unsigned int signature : 32;
        unsigned int number_of_disk : 16;
        unsigned int disk_where_cd_starts : 16;
        unsigned int number_of_cd_records : 16;
        unsigned int total_number_of_cd_records : 16;
        unsigned int size_of_cd : 32;
        unsigned int offset_of_start : 32;
        unsigned int comment_length : 16;
    } ZipEOCD;

    ZipArchive* ZipArchive::GetArchive(const char *filepath)
    {
        FILE *pFile = nullptr;
#ifdef WIN32
        errno_t err;
        if ((err = fopen_s(&pFile, filepath, "rb")) == 0)
#else
        if ((pFile = fopen(filepath, "rb")) == NULL)
#endif
        {
            int fileSignature = 0;
            //Seek to start and read zip header
            fread(&fileSignature, sizeof(int), 1, pFile);
            if (fileSignature != VALID_ZIP_SIGNATURE) return false;

            //Grab the file size
            long fileSize = 0;
            long currPos = 0;

            fseek(pFile, 0L, SEEK_END);
            fileSize = ftell(pFile);
            fseek(pFile, 0L, SEEK_SET);

            //Step back the size of the ZipEOCD 
            //If it doesn't have any comments, should get an instant signature match
            currPos = fileSize;
            int signature = 0;
            while (currPos > 0)
            {
                fseek(pFile, currPos, SEEK_SET);
                fread(&signature, sizeof(int), 1, pFile);
                if (signature == CENTRAL_DIRECTORY_EOCD)
                {
                    break;
                }
                currPos -= sizeof(char); //step back one byte
            }

            if (currPos != 0)
            {
                ZipEOCD zipOECD;
                fseek(pFile, currPos, SEEK_SET);
                fread(&zipOECD, sizeof(ZipEOCD), 1, pFile);

                long memBlockSize = fileSize - zipOECD.offset_of_start;

                //Allocate zip archive of size
                ZipArchive *pArchive = new ZipArchive(memBlockSize);

                //Read in the whole central directory (also includes the ZipEOCD...)
                fseek(pFile, zipOECD.offset_of_start, SEEK_SET);
                fread((void*)pArchive->m_MemBlock, memBlockSize - 10, 1, pFile);
                long currMemBlockPos = 0;
                long currNullTerminatorPos = -1;
                while (currMemBlockPos < memBlockSize)
                {
                    int sig = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos);
                    if (sig != CENTRAL_DIRECTORY_ENTRY_SIGNATURE)
                    {
                        if (sig == CENTRAL_DIRECTORY_EOCD) return pArchive;
                        return nullptr; //something went wrong
                    }

                    if (currNullTerminatorPos > 0)
                    {
                        pArchive->m_MemBlock[currNullTerminatorPos] = '\0';
                        currNullTerminatorPos = -1;
                    }

                    const long offsToFilenameLen = 28;
                    const long offsToFieldLen = 30;
                    const long offsetToFilename = 46;

                    int filenameLength = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos + offsToFilenameLen);
                    int extraFieldLen = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos + offsToFieldLen);
                    const char *pFilepath = &pArchive->m_MemBlock[currMemBlockPos + offsetToFilename];
                    currNullTerminatorPos = (currMemBlockPos + offsetToFilename) + filenameLength;
                    pArchive->m_Entries.push_back(pFilepath);

                    currMemBlockPos += (offsetToFilename + filenameLength + extraFieldLen);
                }

                return pArchive;
            }
        }
        return nullptr;
    }

    ZipArchive::ZipArchive(long size)
    {
        m_MemBlock = new char[size];
    }

    ZipArchive::~ZipArchive()
    {
        delete[] m_MemBlock;
    }

    const std::vector<const char*>  &ZipArchive::GetEntries()
    {
        return m_Entries;
    }
}

TinyZip.h

#ifndef __TinyZip__
#define __TinyZip__

#include <vector>
#include <string>

namespace TinyZip
{
    class ZipArchive
    {
    public:
        ZipArchive(long memBlockSize);
        ~ZipArchive();

        static ZipArchive* GetArchive(const char *filepath);

        const std::vector<const char*>  &GetEntries();

    private:
        std::vector<const char*> m_Entries;
        char *m_MemBlock;
    };

}


#endif

Usage:

 TinyZip::ZipArchive *pArchive = TinyZip::ZipArchive::GetArchive("Scripts_unencrypt.pak");
 if (pArchive != nullptr)
 {
     const std::vector<const char*> entries = pArchive->GetEntries();
     for (auto entry : entries)
     {
         //do stuff
     }
 }
追星践月 2024-10-21 21:50:15

如果有人仍然在解决这个问题 - 请查看我在 GitHub 上托管的存储库,其中包含可以回答您的问题的项目。

Zip 文件阅读器
基本上,它的作用是下载位于文件末尾的 .zip 文件的中央目录部分。
然后它会从字节中读出每个文件和文件夹名称及其路径并将其打印到控制台。

我已经对源代码中更复杂的步骤做了评论。

该程序只能运行到大约 4GB 的 .zip 文件。之后,您将必须对虚拟机大小进行一些更改,甚至可能更多。

享受 :)

In case someone out there is still struggling with this problem - have a look at the repository I hosted on GitHub containing my project that could answer your questions.

Zip file reader
Basically what it does is download the central directory part of the .zip file which resides in the end of the file.
Then it will read out every file and folder name with it's path from the bytes and print it out to console.

I have made comments about the more complicated steps in my source code.

The program can work only till about 4GB .zip files. After that you will have to do some changes to the VM size and maybe more.

Enjoy :)

破晓 2024-10-21 21:50:15

我最近遇到了一个类似的用例,并认为我会为后代分享我的解决方案,因为这篇文章帮助我走向了正确的方向。

使用维基百科此处上详细介绍的 Zip 文件中央目录偏移量,我们可以采用使用以下方法来解析中央目录并检索所包含文件的列表:

步骤:

  1. 通过扫描二进制格式的 zip 文件中的 EOCDR 签名来查找中央目录记录 (EOCDR) 的末尾 ( 0x06054b50),从文件末尾开始(即如果使用 ifstream,则使用 std::ios::ate 反向读取文件)
  2. 使用位于 EOCDR 中的偏移量(距 EOCDR 16 字节)将流读取器定位在中央目录的开头
  3. 使用偏移量(距 CD 46 字节) start)将流读取器定位在文件名处并跟踪其位置起始点
  4. 扫描直到找到另一个中央目录头(0x02014b50)或找到EOCDR,然后跟踪位置
  5. 将读取器重置为文件名的开头并读取直到结尾
  6. 将读取器定位到下一个标头,或者如果找到 EOCDR 则终止

这里的关键点是 EOCDR 由签名 (0x06054b50) 唯一标识仅发生一次。使用16字节偏移量,我们可以将自己定位到中央目录头(0x02014b50)的第一次出现。每条记录都将具有相同的 0x02014b50 标头签名,因此您只需循环出现标头签名,直到再次遇到 EOCDR 结束签名 (0x06054b50)。

摘要:

如果您想查看上述步骤的工作示例,您可以在 GitHub 上查看我的最小实现 (ZipReader) 此处。该实现可以像这样使用:

ZipReader zr;
    
if (zr.SetInput("blah.zip") == ZipReaderStatus::S_FAIL)
        std::cout << "set input error" << std::endl;

std::vector<std::string> entries;
if (zr.GetEntries(entries) == ZipReaderStatus::S_FAIL)
        std::cout << "get entries error" << std::endl;

for (auto entry : entries)
        std::cout << entry << std::endl;

I recently encountered a similar use-case and figured I would share my solution for posterity since this post helped send me in the right direction.

Using the Zip file central directory offsets detailed on Wikipedia here, we can take the following approach to parse the central directory and retrieve a list of the contained files:

STEPS:

  1. Find the end of the central directory record (EOCDR) by scanning the zip file in binary format for the EOCDR signature (0x06054b50), beginning at the end of the file (i.e. read the file in reverse using std::ios::ate if using a ifstream)
  2. Use the offset located in the EOCDR (16 bytes from the EOCDR) to position the stream reader at the beginning of the central directory
  3. Use the offset (46 bytes from the CD start) to position the stream reader at the file name and track its position start point
  4. Scan until either another central directory header is found (0x02014b50) or the EOCDR is found, and track the position
  5. Reset the reader to the start of the file name and read until the end
  6. Position the reader over the next header, or terminate if the EOCDR is found

The key point here is that the EOCDR is uniquely identified by a signature (0x06054b50) that occurs only one time. Using the 16 byte offset, we can position ourselves to the first occurrence of the central directory header (0x02014b50). Each record will have the same 0x02014b50 header signature, so you just need to loop through occurrences of the header signatures until you hit the EOCDR ending signature (0x06054b50) again.

SUMMARY:

If you want to see a working example of the above steps, you can check out my minimal implementation (ZipReader) on GitHub here. The implementation can be used like this:

ZipReader zr;
    
if (zr.SetInput("blah.zip") == ZipReaderStatus::S_FAIL)
        std::cout << "set input error" << std::endl;

std::vector<std::string> entries;
if (zr.GetEntries(entries) == ZipReaderStatus::S_FAIL)
        std::cout << "get entries error" << std::endl;

for (auto entry : entries)
        std::cout << entry << std::endl;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文