可中断的就地排序算法

发布于 2024-09-18 18:33:15 字数 741 浏览 6 评论 0原文

我需要用 C 语言编写一个排序程序,如果可以对文件进行就地排序以节省磁盘空间,那就太好了。数据很有价值,因此我需要确保如果进程中断(ctrl-c),文件不会损坏。我可以保证机器上的电源线不会被拉断。

额外详细信息:文件约为 40GB,记录为 128 位,机器为 64 位,操作系统为 POSIX

有关完成此操作的任何提示或一般说明吗?

谢谢!

澄清一下:我预计用户会想要 ctrl-c 该过程。在这种情况下,我想优雅地退出并确保数据安全。所以这个问题是关于处理中断并选择一种可以在需要时快速结束的排序算法。

后续(2年后):为了后代,我安装了 SIGINT 处理程序,并且效果很好。这并不能保护我免受断电的影响,但这是我可以应对的风险。代码位于 https://code.google.com/p/ pawnsbfs/source/browse/trunk/hsort.chttps://code.google.com/p/pawnsbfs/source/browse/trunk/qsort.c

I need to write a sorting program in C and it would be nice if the file could be sorted in place to save disk space. The data is valuable, so I need to ensure that if the process is interrupted (ctrl-c) the file is not corrupted. I can guarantee the power cord on the machine will not be yanked.

Extra details: file is ~40GB, records are 128-bit, machine is 64-bit, OS is POSIX

Any hints on accomplishing this, or notes in general?

Thanks!

To clarify: I expect the user will want to ctrl-c the process. In this case, I want to exit gracefully and ensure that the data is safe. So this question is about handling interrupts and choosing a sort algorithm that can wrap up quickly if requested.

Following up (2 years later): Just for posterity, I have installed the SIGINT handler and it worked great. This does not protect me against power failure, but that is a risk I can handle. Code at https://code.google.com/p/pawnsbfs/source/browse/trunk/hsort.c and https://code.google.com/p/pawnsbfs/source/browse/trunk/qsort.c

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

躲猫猫 2024-09-25 18:33:16

Jerry 是对的,如果您担心的只是 Ctrl-C,您可以一次忽略 SIGINT 一段时间。如果您想防止进程死亡,您需要某种最小的日志记录。为了交换两个元素:

1) 在文件末尾或单独文件中的控制结构中添加一条记录,指示要交换文件中的哪两个元素 A 和 B。

2) 将 A 复制到暂存空间,记录您已完成的操作,刷新。

3) 将 B 复制到 A 上,然后在暂存空间中记录已完成的操作,刷新

4) 从暂存空间复制到 B 上。

5) 删除记录。

对于所有实际目的来说,这是 O(1) 的额外空间,因此在大多数定义下仍然算作就地。理论上,如果 n 可以任意大,则记录索引的时间复杂度为 O(log n):实际上,它是一个非常小的 log n,并且合理的硬件/运行时间将其限制在 64 以上。

在所有情况下,当我说“刷新”时,我意味着提交更改“足够远”。有时,您的基本刷新操作仅刷新进程内的缓冲区,但它实际上并不同步物理介质,因为它不会通过操作系统/设备驱动程序/硬件级别一直刷新缓冲区。当您担心的只是进程死亡时,这就足够了,但如果您担心介质突然卸载,那么您就必须冲过驱动程序。如果您担心电源故障,则必须同步硬件,但事实并非如此。有了 UPS,或者如果您认为停电很少发生并且不介意丢失数据,那也没关系。

启动时,检查暂存空间中是否有任何“正在进行的交换”记录。如果找到一个,请计算出您已经走了多远,并从那里完成交换以使数据恢复到良好状态。然后重新开始排序。

显然,这里存在性能问题,因为您的记录写入量是以前的两倍,并且刷新/同步可能会非常昂贵。在实践中,您的就地排序可能有一些复合的移动内容操作,涉及许多交换,但您可以对其进行优化以避免每个元素都触及暂存空间。您只需确保在覆盖任何数据之前,您在某处拥有一份安全的副本,并记录该副本应存放的位置,以便使您的文件恢复到仅包含每个元素的一个副本的状态。

Jerry 的观点也是正确的,即对于大多数实际用途来说,真正的就地排序太困难且太慢。如果您可以节省原始文件大小的一些线性部分作为暂存空间,那么您将可以通过合并排序获得更好的时间。

根据您的澄清,即使使用就地排序,您也不需要任何刷新操作。您需要内存中以相同方式工作的暂存空间,并且您的 SIGINT 处理程序可以访问该空间,以便在退出之前确保数据安全,而不是在启动后恢复数据。异常退出,并且您需要以信号安全的方式访问该内存(从技术上讲,这意味着使用 sig_atomic_t 来标记已进行的更改)。即便如此,使用合并排序可能比真正的就地排序更好。

Jerry's right, if it's just Ctrl-C you're worried about, you can ignore SIGINT for periods at a time. If you want to be proof against process death in general, you need some sort of minimal journalling. In order to swap two elements:

1) Add a record to a control structure at the end of the file or in a separate file, indicating which two elements of the file you are going to swap, A and B.

2) Copy A to the scratch space, record that you've done so, flush.

3) Copy B over A, then record in the scratch space that you have done so, flush

4) Copy from the scratch space over B.

5) Remove the record.

This is O(1) extra space for all practical purposes, so still counts as in-place under most definitions. In theory recording an index is O(log n) if n can be arbitrarily large: in reality it's a very small log n, and reasonable hardware / running time bounds it above at 64.

In all cases when I say "flush", I mean commit the changes "far enough". Sometimes your basic flush operation only flushes buffers within the process, but it doesn't actually sync the physical medium, because it doesn't flush buffers all the way through the OS/device driver/hardware levels. That's sufficient when all you're worried about is process death, but if you're worried about abrupt media dismounts then you'd have to flush past the driver. If you were worried about power failure, you'd have to sync the hardware, but you're not. With a UPS or if you think power cuts are so rare you don't mind losing data, that's fine.

On startup, check the scratch space for any "swap-in-progress" records. If you find one, work out how far you got and complete the swap from there to get the data back into a sound state. Then start your sort over again.

Obviously there's a performance issue here, since you're doing twice as much writing of records as before, and flushes/syncs may be astonishingly expensive. In practice your in-place sort might have some compound moving-stuff operations, involving many swaps, but which you can optimise to avoid every element hitting the scratch space. You just have to make sure that before you overwrite any data, you have a copy of it safe somewhere and a record of where that copy should go in order to get your file back to a state where it contains exactly one copy of each element.

Jerry's also right that true in-place sorting is too difficult and slow for most practical purposes. If you can spare some linear fraction of the original file size as scratch space, you'll have a much better time of it with a merge sort.

Based on your clarification, you wouldn't need any flush operations even with an in-place sort. You need scratch space in memory that works the same way, and that your SIGINT handler can access in order to get the data safe before exiting, rather than restoring on startup after an abnormal exit, and you need to access that memory in a signal-safe way (which technically means using a sig_atomic_t to flag which changes have been made). Even so, you're probably better off with a mergesort than a true in-place sort.

讽刺将军 2024-09-25 18:33:16

SIGINT 安装一个处理程序,该处理程序仅设置“进程应立即退出”标志。

在排序中,每次交换两条记录后(或每 N 次交换后)检查标志。如果设置了标志,则退出。

Install a handler for SIGINT that just sets a "process should exit soon" flag.

In your sort, check the flag after every swap of two records (or after every N swaps). If the flag is set, bail out.

他夏了夏天 2024-09-25 18:33:16

防止 ctrl-c 的部分非常简单:signal(SIGINT, SIG_IGN);

就排序本身而言,合并排序通常适用于外部排序。基本思想是将尽可能多的记录读入内存,对它们进行排序,然后将它们写回磁盘。到目前为止,处理此问题的最简单方法是将每次运行写入磁盘上的单独文件。然后将它们合并在一起——将每次运行的第一条记录读入内存,并将其中最小的记录写入原始文件;从提供该记录的运行中读取另一条记录,然后重复直到完成。最后一个阶段是您唯一修改原始文件的时间,因此这是您真正需要确保不发生中断等的唯一时间。

另一种可能性是使用选择排序。缺点是排序本身相当慢。好处是,编写它可以很容易地适应几乎所有情况,而无需使用太多额外空间。总体思路非常简单:找到文件中最小的记录,并将其交换到第一个位置。然后找到剩下的最小记录,并将其交换到第二个位置,依此类推,直到完成。这样做的好处是日志记录很简单:在进行交换之前,记录要交换的两条记录的值。由于排序是从第一条记录到最后一条记录,因此您唯一需要跟踪的是在任何给定时间有多少记录已经排序。

The part for protecting against ctrl-c is pretty easy: signal(SIGINT, SIG_IGN);.

As far as the sorting itself goes, a merge sort generally works well for external sorting. The basic idea is to read as many records into memory as you can, sort them, then write them back out to disk. By far the easiest way to handle this is to write each run to a separate file on disk. Then you merge those back together -- read the first record from each run into memory, and write the smallest of those out to the original file; read another record from the run that supplied that record, and repeat until done. The last phase is the only time you're modifying the original file, so it's the only time you really need to assure against interruptions and such.

Another possibility is to use a selection sort. The bad point is that the sorting itself is quite slow. The good point is that it's pretty easy to write it to survive almost anything, without using much extra space. The general idea is pretty simple: find the smallest record in the file, and swap that into the first spot. Then find the smallest record of what's left, and swap that into the second spot, and so on until done. The good point of this is that journaling is trivial: before you do a swap, you record the values of the two records you're going to swap. Since the sort runs from the first record to the last, the only other thing you need to track is how many records are already sorted at any given time.

Saygoodbye 2024-09-25 18:33:16

使用堆排序,并防止每次交换操作期间出现中断(例如块信号)。

Use heap sort, and prevent interruptions (e.g. block signals) during each swap operation.

携余温的黄昏 2024-09-25 18:33:16

备份您计划更改的任何内容。添加一个标记,标志着排序成功。如果一切正常则保留结果,否则恢复备份。

Backup whatever you plan to change. The put a flag that marks a successful sort. If everything is OK then keep the result, otherwise restore backup.

久隐师 2024-09-25 18:33:16

假设是 64 位操作系统(您说它是 64 位机器,但仍然可以运行 32 位操作系统),您可以使用 mmap 将文件映射到数组,然后在数组上使用 qsort。

添加 SIGINT 的处理程序以调用 msync 和 munmap,以允许应用程序响应 Ctrl-C 而不会丢失数据。

Assuming a 64-bit OS (you said it is a 64bit machine but could still be running 32bit OS), you could use mmap to map the file to an array then use qsort on the array.

Add a handler for SIGINT to call msync and munmap to allow the app to respond to Ctrl-C without losing data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文