最后映射的页面

发布于 2024-11-19 18:20:46 字数 1578 浏览 1 评论 0原文

POSIX 说“系统总是将对象末尾的任何部分页填零。此外，系统绝不会写出对象最后一页超出其末尾的任何修改部分。” ，Linux 和 FreeBSD 文档在其手册页中都有类似的措辞。
这表明，虽然读取最后的尾随字节并不严格合法（因为它们超出了映射范围），但它仍然是明确定义和设计的，因此它可能发生而不会崩溃。甚至写入该区域也是明确定义的。

另一方面，Windows 文档没有提及任何有关小于块大小范围内的尾随字节的内容，并且确实警告创建大于文件的映射会增加文件大小，并且不一定将数据归零。
我倾向于认为这要么是错误的信息，要么是历史性的（也许可以追溯到 Win95？）。 SetFileValidData 需要非标准用户权限，因为出于安全考虑，这可能会使以前删除的文件中的数据可见。如果 Windows 内核开发人员允许任何人通过映射任何随机文件来轻松绕过这个问题，那么他们一定是非常愚蠢的。
我对 Windows XP 的观察是，任何新页面显然都是从零池中提取的，对于空页面写回，文件要么默默地变得稀疏，要么以非常非常智能的方式完成写回（在任何情况下都没有明显的延迟）时间，甚至在千兆字节范围内）。

那么问题是什么？

我需要计算（可能是数千个）文件的哈希值以检测已修改的文件子集。人们可以假设 SHA-256 作为算法，但实际算法并不重要。
这当然不是什么大挑战，但像每个软件一样，它应该立即运行并且不使用内存，等等。通常的现实期望，你明白了:-)

计算这种散列的正常方法是检查消息的大小是否与散列函数的块大小（例如64字节）一致如果不是这种情况，则对最后一个不完整的块进行零填充。此外，哈希可能有对齐要求。
这通常意味着您必须制作消息的完整副本，或者编写一些特殊代码来散列除一个块之外的所有块以及最后一个块的零填充副本。或者类似的东西。哈希算法也常常默默地代表自己做这种事情。无论如何，它涉及移动大量数据，并且比人们希望的更加复杂。

现在存在直接散列内存映射文件并依赖于文件映射必然依赖于内存页这一事实的诱惑。因此，起始地址和物理映射长度或多或少都保证是 4kB 的倍数（在某些系统上为 64kB）。当然，这意味着它们也自动是 64、128 或哈希可能具有的任何其他块大小的倍数。
出于安全原因，实际上没有操作系统可以为您提供包含过时数据的页面。

这意味着您可以天真地对整个文件进行散列，而不必担心对齐、填充或其他任何事情，并避免复制数据。它可能读取超出映射范围末尾的几个字节，但它必然仍位于同一页面内。

我当然知道这技术上是非法的。读取映射范围之外的最后一个字节在某种程度上类似于 malloc(5) 始终返回 8 字节块，因此使用额外的 3 个字节是安全的。

不过，除了显而易见的事情之外，我认为这将“正常工作”的假设是否合理，或者是否存在一些我在任何主要平台上都看不到的严重问题？

我对理论或历史操作系统并不太感兴趣，但我想保持某种程度上的可移植性。也就是说，我想确保它能够在您可能在台式计算机或“典型托管服务器”（因此，主要是 Windows、Linux、BSD、OSX）上遇到的任何东西上可靠地工作。
如果存在一个 1985 年的操作系统，它将最后一页标记为不可读，并在其故障处理程序中强制执行严格的字节范围，我对此表示同意。你不能（也不应该）让每个人都高兴。

原文

POSIX says "The system always zero-fills any partial page at the end of an object. Further, the system never writes out any modified portions of the last page of an object that are beyond its end.", and both the Linux and FreeBSD documentations have similar wordings in their man pages.
This suggests that although it is not strictly legitimate to read the last trailing bytes (as they are outside the mapped range), it is still well-defined and designed in a way so it may happen without crashing. Even writing to that area is kind of well-defined.

The Windows documentation on the other hand does not say anything about trailing bytes in a less-than-blocksize range, and indeed warns that creating a mapping larger than the file will increase the file size and will not necessarily zero the data.
I'm inclined to believe that this is either wrong information or historic (maybe dating back to Win95?). SetFileValidData requires non-standard user rights because of the security concern that this might make data from a previously deleted file visible. If the Windows kernel developers allowed anyone to trivially bypass this by mapping any random file, they would have to be quite stupid.
My observation on Windows XP is that any new pages are apparently drawn from the zero pool, and for empty page writeback, either the file is silently made sparse, or the writeback is done in a very, very intelligent way (no noticeable delay at any time, even in the gigabyte range).

So what is the question about?

I need to calculate the hash values of (possibly thousands of) files to detect a subset of files that was modified. One can assume SHA-256 as the algorithm, though the actual algorithm does not really matter.
Which as such is of course no big challenge, but like every software, it should run in no time and use no memory, and so on. The usual realistic expectations, you get it :-)

The normal way to calculate such a hash is to check whether the message has a size in accordance with the hash function's block size (say e.g. 64 bytes) and zero-fill the last incomplete block if that is not the case. Additionally, the hash may have alignment requirements.
This normally means that you must either make a full copy of the message, or write some special code that hashes all but one block plus a zero-padded copy of the last block. Or something similar. The hash algorithm often silently does that kind of thing on its own behalf, too. In any case it involves moving around a lot of data and more complexity than one would hope for.

Now there is the temptation of directly hashing off a memory-mapped file and relying on the fact that file mapping necessarily depends on memory pages. Thus, both the start address and the physically mapped length are more or less guaranteed to be multiples of 4kB (64kB on some systems). Which of course means they are automatically also multiples of 64, 128, or any other block size that a hash might have.
And for security reasons, actually no OS can afford to give you a page containing stale data.

Which means you could just naively hash over the entire file without worrying about alignments, padding or anything, and avoiding to copy data. It might read a few bytes past the end of the mapped range, but it will necessarily still be within the same page.

I am of course aware that this is technically illegal. Reading the last bytes outside the mapped range is somewhat comparable to saying that malloc(5) always returns an 8-byte block anyway, so it's safe to use the extra 3 bytes.

Though, apart from that obvious thing, is my assumption that this will "just work" reasonable, or is there some serious problem that I fail to see on any major platform?

I'm not really all too much interested in theoretic or historic operating systems, but I'd like to stay somewhat portable. That is, I would like to be sure it works reliably on anything you are likely to encounter on a desktop computer or a "typical hosting server" (so, mostly Windows, Linux, BSD, OSX).
If there exists an operating system from 1985 which marks the last page non-readable and enforces strict byte-ranges inside its fault handler, I'm ok with that. You can't (and shouldn't) make everyone happy.

分享到QQ

分享到微博