压缩阻塞文件中的记录的好算法是什么？

发布于 2024-07-05 21:41:00 字数 622 浏览 16 评论 0原文

假设您有一个由一堆固定大小的块组成的大文件。每个块都包含一定数量的可变大小的记录。每条记录必须完全适合单个块，并且根据定义，此类记录永远不会大于整个块。随着时间的推移，随着记录从这个“数据库”中移入和移出，记录会被添加到这些块中或从这些块中删除。

在某些时候，尤其是在许多记录添加到数据库并删除一些记录之后，许多块可能最终只被部分填充。

有什么好的算法可以打乱数据库中的记录，通过更好地填充部分填充的块来压缩文件末尾不必要的块？

算法的要求：

压缩必须发生在原始文件的位置，而不会临时将文件从其起始大小扩展到最多几个块
算法不应不必要地干扰已经基本满的块
必须读取整个块或一次从文件写入/写入文件，并且应该假设写入操作相对昂贵
如果记录从一个块移动到另一个块，则必须将它们添加到新位置，然后再从其起始位置删除，以便万一操作中断不会因“失败”压缩而丢失任何记录。（假设可以在恢复时检测到此类记录的临时重复）。
可用于此操作的内存只能约为几个块，这仅占整个文件大小的很小一部分
假设记录约为 10 字节到 1K 字节，平均大小可能为 100字节。固定大小的块约为 4K 或 8K，文件约为 1000 个块。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

撩动你心 2024-07-12 21:41:00

如果这些记录没有排序，我只需用从最后一个块中提取的记录填充前面的块。这将最大限度地减少数据移动，相当简单，并且应该能够很好地紧密打包数据。

例如：

// records should be sorted by size in memory (probably in a balanced BST)
records = read last N blocks on disk;

foreach (block in blocks) // read from disk into memory
{
    if (block.hasBeenReadFrom())
    {
        // we read from this into records already
        // all remaining records are already in memory

        writeAllToNewBlocks(records);

        // this will leave some empty blocks on the disk that can either
        // be eliminated programmatically or left alone and filled during
        // normal operation

        foreach (record in records)
        {
            record.eraseFromOriginalLocation();
        }

        break;
    }

    while(!block.full())
    {
        moveRecords = new Array; // list of records we've moved

        size = block.availableSpace();
        record = records.extractBestFit(size);
        if (record == null)
        {
            break;
        }

        moveRecords.add(record);
        block.add(record);

        if (records.gettingLow())
        {
            records.readMoreFromDisk();
        }
    }

    if(moveRecords.size() > 0)
    {
        block.writeBackToDisk();
        foreach (record in moveRecords)
        {
            record.eraseFromOriginalLocation();
        }
    }
}

更新：我忽略了维护仅内存中无块的规则。我已经更新了伪代码来解决这个问题。还修复了我的循环条件中的一个故障。

If there is no ordering to these records, I'd simply fill the blocks from the front with records extracted from the last block(s). This will minimize movement of data, is fairly simple, and should do a decent job of packing data tightly.

E.g.:

// records should be sorted by size in memory (probably in a balanced BST)
records = read last N blocks on disk;

foreach (block in blocks) // read from disk into memory
{
    if (block.hasBeenReadFrom())
    {
        // we read from this into records already
        // all remaining records are already in memory

        writeAllToNewBlocks(records);

        // this will leave some empty blocks on the disk that can either
        // be eliminated programmatically or left alone and filled during
        // normal operation

        foreach (record in records)
        {
            record.eraseFromOriginalLocation();
        }

        break;
    }

    while(!block.full())
    {
        moveRecords = new Array; // list of records we've moved

        size = block.availableSpace();
        record = records.extractBestFit(size);
        if (record == null)
        {
            break;
        }

        moveRecords.add(record);
        block.add(record);

        if (records.gettingLow())
        {
            records.readMoreFromDisk();
        }
    }

    if(moveRecords.size() > 0)
    {
        block.writeBackToDisk();
        foreach (record in moveRecords)
        {
            record.eraseFromOriginalLocation();
        }
    }
}

Update: I neglected to maintain the no-blocks-only-in-memory rule. I've updated the pseudocode to fix this. Also fixed a glitch in my loop condition.

回复收藏 0 原文

停滞 2024-07-12 21:41:00

您可以利用以下算法，尽管固定大小块内的记录可能需要更多工作。

有限时间内的堆碎片整理

回复收藏 0 原文

若言繁花未落 2024-07-12 21:41:00

这听起来像是装箱问题的变体，但是您已经有一个较差的分配你想要进步。因此，我建议研究一些成功解决装箱问题的方法的变体。

首先，您可能希望通过定义您认为“足够满”（块足够满以至于您不想碰它）和“太空”（块有如此多的空白空间以至于必须添加更多记录）。然后，您可以将所有块分类为足够满、太空或部分满（那些既不够满也不太空的块）。然后，您将问题重新定义为如何通过创建尽可能多的足够满的块来消除所有太空的块，同时最小化部分满的块的数量。

您还需要弄清楚什么更重要：将记录放入尽可能少的块中，或者充分打包它们，但读取和写入的块数量最少。

我的方法是对所有块进行初始遍历，将它们全部分类到上面定义的三个类别之一。对于每个块，您还希望跟踪其中的可用空间，对于太空的块，您需要获得所有记录及其大小的列表。然后，从太空的块中最大的记录开始，将它们移动到部分满的块中。如果您想最大程度地减少读取和写入，请将它们移动到内存中当前拥有的任何块中。如果您想最大程度地减少浪费的空间，请找到仍可保存记录的空白空间最少的块，并在必要时读取该块。如果没有块可以保存记录，则创建一个新块。如果内存中的块达到“足够满”阈值，请将其写出。重复此操作，直到部分填充的块中的所有记录都已放置完毕。

我跳过了很多细节，但这应该可以让您有所了解。请注意，装箱问题是NP-hard，这意味着对于实际应用，您需要决定什么对您来说最重要，因此您可以选择一种方法，在合理的时间内为您提供近似最佳的解决方案。

This sounds like a variation of the bin packing problem, but where you already have an inferior allocation that you want to improve. So I suggest looking at variations of the approaches which are successful for the bin packing problem.

First of all, you probably want to parameterize your problem by defining what you consider "full enough" (where a block is full enough that you don't want to touch it), and what is "too empty" (where a block has so much empty space that it has to have more records added to it). Then, you can classify all the blocks as full enough, too empty, or partially full (those that are neither full enough nor too empty). You then redefine the problem as how to eliminate all the too empty blocks by creating as many full enough blocks as possible while minimising the number of partially full blocks.

You'll also want to work out what's more important: getting the records into the fewest blocks possible, or packing them adequately but with the least amount of blocks read and written.

My approach would be to make an initial pass over all the blocks, to classify them all into one of the three classes defined above. For each block, you also want to keep track of the free space available in it, and for the too empty blocks, you'll want to have a list of all the records and their sizes. Then, starting with the largest records in the too empty blocks, move them into the partially full blocks. If you want to minimise reads and writes, move them into any of the blocks you currently have in memory. If you want to minimise wasted space, find the block with the least amount of empty space that will still hold the record, reading the block in if necessary. If no block will hold the record, create a new block. If a block in memory reaches the "full enough" threshold, write it out. Repeat until all the records in the partially filled blocks have been placed.

I've skipped over more than a few details, but this should give you some idea. Note that the bin packing problem is NP-hard, which means that for practical applications, you'll want to decide what's most important to you, so you can pick an approach that will give you an approximately best solution in reasonable time.

回复收藏 0 原文