fork 和内核中映射的用户空间内存的交互
考虑一个使用 get_user_pages
(或 get_page
)从调用进程映射页面的 Linux 驱动程序。然后,页面的物理地址被传递到硬件设备。进程和设备都可以读取和写入页面,直到双方决定结束通信。特别是,在调用get_user_pages
的系统调用返回之后,通信可以继续使用页面。系统调用实际上是在进程和硬件设备之间设置一个共享内存区域。
我担心如果进程调用 fork
会发生什么(它可能来自另一个线程,并且可能在调用 get_user_pages
的系统调用时发生) > 正在进行中或稍后)。特别是,如果父进程在分叉后写入共享内存区域,我对底层物理地址(可能由于写时复制而更改)了解多少?我想了解:
- 内核需要做什么来防御潜在的行为不当的进程(我不想创建一个安全漏洞!);
进程需要遵守哪些限制才能使我们的驱动程序的功能正常工作(即物理内存保持映射到父进程中的同一地址)。
- 理想情况下,我希望子进程根本不使用我们的驱动程序(它可能几乎立即调用
exec
)的常见情况能够正常工作。 - 理想情况下,父进程在分配内存时不必采取任何特殊步骤,因为我们已有代码将堆栈分配的缓冲区传递给驱动程序。
- 我知道
madvise
和MADV_DONTFORK
,让内存从子进程的空间中消失是可以的,但它不适用于堆栈 -分配的缓冲区。 - “当您与我们的驱动程序保持有效连接时,请勿使用 fork”会很烦人,但如果满足第 1 点,作为最后的手段也是可以接受的。
- 理想情况下,我希望子进程根本不使用我们的驱动程序(它可能几乎立即调用
我愿意被指出文档或源代码。我特别查看了 Linux 设备驱动程序,但没有发现此问题得到解决。即使仅应用于内核源代码的相关部分,RTFS 也有点让人不知所措。
内核版本尚未完全修复,而是最新版本(假设 ≥2.6.26)。如果重要的话,我们只针对 Arm 平台(到目前为止是单处理器,但多核即将到来)。
Consider a Linux driver that uses get_user_pages
(or get_page
) to map pages from the calling process. The physical address of the pages are then passed to a hardware device. Both the process and the device may read and write to the pages until the parties decide to end the communication. In particular, the communication may continue using the pages after the system call that calls get_user_pages
returns. The system call is in effect setting up a shared memory zone between the process and the hardware device.
I'm concerned about what happens if the process calls fork
(it could be from another thread, and could happen either while the syscall that calls get_user_pages
is in progress or later). In particular, if the parent writes to the shared memory area after the fork, what do I know about the underlying physical address (presumably changed due to copy-on-write)? I want to understand:
- what the kernel needs to do to defend against a potentially misbehaving process (I don't want to create a security hole!);
what restrictions the process need to obey so that the functionality of our driver works correctly (i.e. the physical memory remains mapped at the same address in the parent process).
- Ideally, I would like the common case where the child process doesn't use our driver at all (it probably calls
exec
almost immediately) to work. - Ideally, the parent process should not have to take any special steps when allocating the memory, as we have existing code that passes a stack-allocated buffer to the driver.
- I'm aware of
madvise
withMADV_DONTFORK
, and it would be ok to have the memory disappear from the child process's space, but it's not applicable to a stack-allocated buffer. - “Don't use fork while you have a connection active with our driver” would be annoying, but acceptable as a last resort if point 1 is satisfied.
- Ideally, I would like the common case where the child process doesn't use our driver at all (it probably calls
I'm willing to be pointed to documentation or source code. I've looked in particular at Linux Device Drivers, but didn't find this issue addressed. RTFS applied to even just the relevant part of the kernel source is a bit overwhelming.
The kernel version is not completely fixed but is a recent one (let's say ≥2.6.26). We're only targetting Arm platforms (single-processor so far but multicore is just round the corner), if it matters.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
fork()
不会干扰get_user_pages()
:get_user_pages()
将为您提供一个struct page
。在能够访问它之前,您需要
kmap()
它,并且此映射是在内核空间而不是用户空间中完成的。编辑: get_user_pages() 触摸页表,但您不应该担心这一点(它只是确保页面映射在用户空间中),如果这样做有任何问题,则返回 -EFAULT 。
如果您 fork(),直到执行写入时复制,子级将能够看到该页面。
一旦写时复制完成(因为子进程/驱动程序/父进程通过用户空间映射写入页面——而不是驱动程序具有的内核 kmap()),该页面将不再被共享。如果您仍然在页面上(在驱动程序代码中)持有 kmap(),您将无法知道您持有的是父页面还是子页面。
1)这不是一个安全漏洞,因为一旦你执行了 execve(),所有这些都消失了。
2)当您 fork() 时,您希望两个进程相同(这是一个分叉!!)。我认为你的设计应该允许父母和孩子都访问驱动程序。 Execve() 将刷新所有内容。
在用户空间中添加一些功能怎么样:
当在设备上调用 mmap() 时,您将安装带有特殊标志的内存映射:
http://os1a.cs.columbia.edu/lxr /source/include/linux/mm.h#071
您有一些有趣的事情,例如:
VM_SHARED 将禁用写入时复制
VM_LOCKED 将禁用该页面上的交换
VM_DONTCOPY 会告诉内核不要复制 fork 上的 vma 区域,尽管我认为这不是一个好主意
A
fork()
will not interfere withget_user_pages()
:get_user_pages()
will give you astruct page
.You would need to
kmap()
it before being able to access it, and this mapping is done in kernel space, not userspace.EDIT:
get_user_pages()
touch the page table, but you should not be worried about this (it just make sure that the pages are mapped in userspace), and returns -EFAULT if it had any problem doing so.If you fork(), until copy-on-write is performed, the child will be able to see that page.
Once copy-on-write is done (because the child/the driver/the parent wrote to the page through the userspace mapping -- not the kernel kmap() the driver has), that page will no longer be shared. If you still hold a kmap() on the page (in the driver code), you will not be able to know if you are holding the parent page or the child's.
1) It's not a security hole, because once you execve(), all of that is gone.
2) When you fork() you want both process to be identical (It's a fork !!). I would think that your design should allow both the parent and the child to access the driver. Execve() will flush everything.
What about adding some functionality in userspace like:
When mmap() is called on your device, you install a memory mapping, with special flags:
http://os1a.cs.columbia.edu/lxr/source/include/linux/mm.h#071
You have some interesting things like:
VM_SHARED will disable copy on write
VM_LOCKED will disable swapping on that page
VM_DONTCOPY will tell the kernel not to copy the vma region on fork, although I don't think it's a good idea
简短的答案是在您提供给驱动程序的任何用户空间缓冲区上使用 madvise(addr, len, MADV_DONTFORK) 。这告诉内核不应将映射从父级复制到子级,因此不存在 CoW。
缺点是子进程继承了该地址处的任何映射,因此如果您希望子进程开始使用驱动程序,则需要重新映射该内存。但这在用户空间中相当容易做到。
更新:堆栈上的缓冲区有问题,我不确定您是否可以使其安全。
您不能将其标记为“DONTFORK”,因为您的子进程在分叉时可能正在该堆栈页上运行,或者(在某种程度上更糟糕)它可能会稍后执行函数返回并命中未映射的堆栈页。 (我什至测试了这一点,你可以愉快地将你的堆栈标记为DONTFORK,当你分叉时会发生不好的事情)。
避免 CoW 的另一种方法是创建共享映射,但由于显而易见的原因,您无法映射共享堆栈。
这意味着如果您分叉,您将面临 CoW 的风险。即使子进程“只是”执行,它仍然可能会触及堆栈页面并导致 CoW,导致父进程获得不同的页面,这是不好的。
对您有利的一个小问题是,使用堆栈缓冲区的代码只需要担心它调用分叉的代码,即。函数返回后不能使用堆栈缓冲区。因此,您只需要审核您的被调用者,如果他们从不分叉,您就是安全的,但这仍然可能是不可行的,并且如果代码发生更改,则很脆弱。
我认为您确实希望分配给驱动程序的所有内存都来自用户空间中的自定义分配器。它不应该那么具有侵入性。分配器可以直接
mmap
您的设备,正如其他答案所建议的那样,或者只使用匿名mmap
、madvise(DONTFORK)
以及可能 < code>mlock() 以避免换出。The short answer is to use
madvise(addr, len, MADV_DONTFORK)
on any userspace buffers you give to your driver. This tells the kernel that the mapping should not be copied from parent to child and so there is no CoW.The drawback is that the child inherits no mapping at that address, so if you want the child to then start using the driver it will need to remap that memory. But that is fairly easy to do in userspace.
Update: A buffer on the stack is problematic, I'm not sure you can make it safe in general.
You can't mark it
DONTFORK
, because your child might be running on that stack page when it forks, or (worse in a way) it might do a function return later and hit the unmapped stack page. (I even tested this, you can happily mark your stack DONTFORK, bad things happen when you fork).The other way to avoid a CoW is to create a shared mapping, but you can't map your stack shared for obvious reasons.
That means you risk a CoW if you fork. Even if the child "just" execs it might still touch the stack page and cause a CoW, leading to the parent getting a different page, which is bad.
The one minor point in your favor is that code using an on-stack buffer only needs to worry about code it calls forking, ie. you can't use an on-stack buffer after the function has returned. So you only need to audit your callees, and if they never fork you're safe, but that still may be infeasible, and is fragile if the code ever changes.
I think you really want to have all memory that is given to your driver to come from a custom allocator in userspace. It shouldn't be that intrusive. The allocator can either
mmap
your device directly, as the other answer suggested, or just use anonymousmmap
,madvise(DONTFORK)
, and probablymlock()
to avoid swap out.