可以在每次提交时使用单个 fsync 来实现日志记录吗?
假设您正在构建一个日记/预写日志存储系统。您可以通过(对于每个事务)附加数据(使用 write(2))、附加提交标记,然后进行 fsync 来简单地实现此操作吗?
要考虑的场景是,如果您对此日志进行大量写入,然后对其进行 fsync,并且 fsync 期间会发生故障。索引节点直接/间接块指针是否仅在刷新所有数据块后才刷新,或者是否不能保证块按顺序刷新?如果是后者,那么在恢复期间,如果您在文件末尾看到提交标记,则不能相信它与前一个提交标记之间的数据是有意义的。因此,您必须依赖另一种机制(至少涉及另一个 fsync)来确定日志文件的一致程度(例如,写入/fsyncing 数据,然后写入/fsyncing 提交标记)。
如果有什么不同,主要是想知道 ext3/ext4 作为上下文。
Let's say you're building a journaling/write-ahead-logging storage system. Can you simply implement this by (for each transaction) appending the data (with write(2)), appending a commit marker, and then fsync-ing?
The scenario to consider is if you do a large set of writes to this log then fsync it, and there's a failure during the fsync. Are the inode direct/indirect block pointers flushed only after all data blocks are flushed, or are there no guarantees that blocks are being flushed in order? If the latter, then during recovery, if you see a commit marker at the end of the file, you can't trust that the data between it and the previous commit marker is meaningful. Thus you have to rely on another mechanism (involving at least another fsync) to determine what extent of the log file is consistent (e.g., writing/fsyncing the data, then writing/fsyncing the commit marker).
If it makes a difference, mainly wondering about ext3/ext4 as the context.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
无法保证块刷新到磁盘的顺序。如今,甚至驱动器本身也可以在块到达盘片的途中重新排序。
如果您想强制排序,则至少需要在要排序的写入之间使用
fdatasync()
。同步的所有承诺是,当它返回时,同步之前写入的所有内容都已存储。There's no guarantee on the order in which blocks are flushed to disk. These days even the drive itself can re-order blocks on their way to the platters.
If you want to enforce ordering, you need to at least
fdatasync()
between the writes that you want ordered. All a sync promises is that when it returns, everything written before the sync has hit storage.请注意,linux 和 mac os 的 fsync 和 fdatasync 默认情况下是不正确的。 Windows 默认情况下是正确的,但可以模拟 Linux 以进行基准测试。
此外,如果您追加到文件末尾,fdatasync 会发出多个磁盘写入,因为它需要使用新长度更新文件 inode。如果您希望每次提交写入一次,最好的选择是预先分配日志空间,在提交标记中存储日志条目的 CRC,并在提交时发出单个 fdatasync()。这样,无论操作系统/硬件在背后重新排序多少,您都可以找到实际命中磁盘的日志的前缀。
如果您想使用日志进行持久提交或预写,事情会变得更困难,因为您需要确保 fsync 确实有效。在 Linux 下,您需要使用 hdparm 禁用磁盘写入缓存,或者挂载屏障设置为 true 的分区。 [编辑:我纠正了,屏障似乎没有给出正确的语义。 SATA 和 SCSI 引入了许多原语,例如写屏障和本机命令队列,使操作系统可以导出启用预写日志记录的原语。从我从联机帮助页和网上了解到的情况来看,Linux 只向文件系统开发人员公开这些内容,而不向用户空间
公开。]矛盾的是,禁用磁盘写入缓存有时会带来更好的性能,因为您可以更好地控制用户空间中的写入调度;因此,禁用磁盘写入缓存有时会带来更好的性能。如果磁盘将一堆同步写入请求排队,最终会给应用程序带来奇怪的延迟峰值。禁用写入缓存可以防止这种情况发生。
最后,真实的系统使用组提交,并且做<每次提交与并发工作负载同步写入 1 次。
Note that linux's and mac os's fsync and fdatasync are incorrect by default. Windows is correct by default, but can emulate linux for benchmarking purposes.
Also, fdatasync issues multiple disk writes if you append to the end of a file, since it needs to update the file inode with the new length. If you want to have one write per commit, your best bet is to pre-allocate log space, store a CRC of the log entries in the commit marker, and issue a single fdatasync() at commit. That way, no matter how much the OS / hardware reorder behind your back, you can find a prefix of the log that actually hit disk.
If you want to use the log for durable commits or write ahead, things get harder, since you need to make sure that fsync actually works. Under Linux, you'll want to disable the disk write cache with hdparm, or mount the partition with barrier set to true. [Edit: I stand corrected, barrier doesn't seem to give the correct semantics. SATA and SCSI introduce a number of primitives, such as write barriers and native command queuing, that make it possible for operating systems to export primitives that enable write-ahead logging. From what I can tell from manpages and online, Linux only exposes these to filesystem developers, not to userspace.]
Paradoxically, disabling the disk write cache sometimes leads to better performance, since you get more control over write scheduling in user space; if the disk queues up a bunch of synchronous write requests, you end up exposing strange latency spikes to the application. Disabling write cache prevents this from happening.
Finally, real systems use group commit, and do < 1 sync write per commit with concurrent workloads.