We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(5)
简而言之,这是我使用的...
get_user_pages
来固定用户页面并为您提供一个struct page *
指针数组。每个
struct page *
上的dma_map_page
以获取该页的DMA地址(也称为“I/O地址”)。这还会创建 IOMMU 映射(如果您的平台需要)。现在告诉您的设备使用这些 DMA 地址对内存执行 DMA。显然它们可以是不连续的;内存仅保证页面大小的倍数是连续的。
dma_sync_single_for_cpu 进行任何必要的缓存刷新或反弹缓冲区位块传送或其他操作。此调用保证 CPU 实际上可以看到 DMA 的结果,因为在许多系统上,修改 CPU 后面的物理 RAM 会导致缓存失效。
dma_unmap_page
释放 IOMMU 映射(如果您的平台需要)。put_page
取消固定用户页面。请注意,您必须从头到尾检查错误,因为各处的资源都是有限的。
get_user_pages
对于彻底错误 (-errno) 返回一个负数,但它可以返回一个正数来告诉您它实际管理了多少页(物理内存不是无限的)。如果这小于您的请求,您仍然必须循环遍历它固定的所有页面,以便在它们上调用put_page
。 (否则您将泄漏内核内存;非常糟糕。)dma_map_page
也可能返回错误 (-errno),因为 IOMMU 映射是另一个有限资源。dma_unmap_page
和put_page
返回void
,与 Linux“释放”函数一样。 (Linux 内核资源管理例程只返回错误,因为确实出了问题,而不是因为您搞砸了并传递了错误的指针或其他东西。基本假设是您永远不会搞砸,因为这是内核代码尽管get_user_pages
确实会检查以确保用户地址的有效性,并且如果用户向您提供了错误的指针,则会返回错误。)如果您想要一个友好的界面,您也可以考虑使用 _sg 函数。分散/聚集。然后,您将调用
dma_map_sg
而不是dma_map_page
,dma_sync_sg_for_cpu
而不是dma_sync_single_for_cpu
等。另请注意,其中许多函数在您的平台上可能或多或少是无操作的,因此您通常可以避免马虎。 (特别是, dma_sync_... 和 dma_unmap_... 在我的 x86_64 系统上不执行任何操作。)但是在这些平台上,调用本身不会编译成任何内容,因此没有理由马虎。
Here is what I have used, in brief...
get_user_pages
to pin the user page(s) and give you an array ofstruct page *
pointers.dma_map_page
on eachstruct page *
to get the DMA address (aka. "I/O address") for the page. This also creates an IOMMU mapping (if needed on your platform).Now tell your device to perform the DMA into the memory using those DMA addresses. Obviously they can be non-contiguous; memory is only guaranteed to be contiguous in multiples of the page size.
dma_sync_single_for_cpu
to do any necessary cache flushes or bounce buffer blitting or whatever. This call guarantees that the CPU can actually see the result of the DMA, since on many systems, modifying physical RAM behind the CPU's back results in stale caches.dma_unmap_page
to free the IOMMU mapping (if it was needed on your platform).put_page
to un-pin the user page(s).Note that you must check for errors all the way through here, because there are limited resources all over the place.
get_user_pages
returns a negative number for an outright error (-errno), but it can return a positive number to tell you how many pages it actually managed to pin (physical memory is not limitless). If this is less than you requested, you still must loop through all of the pages it did pin in order to callput_page
on them. (Otherwise you are leaking kernel memory; very bad.)dma_map_page
can also return an error (-errno), because IOMMU mappings are another limited resource.dma_unmap_page
andput_page
returnvoid
, as usual for Linux "freeing" functions. (Linux kernel resource management routines only return errors because something actually went wrong, not because you screwed up and passed a bad pointer or something. The basic assumption is that you are never screwing up because this is kernel code. Althoughget_user_pages
does check to ensure the validity of the user addresses and will return an error if the user handed you a bad pointer.)You can also consider using the _sg functions if you want a friendly interface to scatter/gather. Then you would call
dma_map_sg
instead ofdma_map_page
,dma_sync_sg_for_cpu
instead ofdma_sync_single_for_cpu
, etc.Also note that many of these functions may be more-or-less no-ops on your platform, so you can often get away with being sloppy. (In particular, dma_sync_... and dma_unmap_... do nothing on my x86_64 system.) But on those platforms, the calls themselves get compiled into nothing, so there is no excuse for being sloppy.
好的,这就是我所做的。
免责声明:我是一名纯粹意义上的黑客,我的代码并不是最漂亮的。
我阅读了 LDD3 和 infiniband 源代码以及其他前身的东西,并认为
get_user_pages
并固定它们以及所有其他繁琐的事情在宿醉时思考起来太痛苦了。此外,我还与其他人一起通过 PCIe 总线进行工作,并且还负责“设计”用户空间应用程序。我编写了驱动程序,以便在加载时,它通过调用函数 myAddr[i] = pci_alloc_concient(blah,size,&pci_addr[i]) 来预分配尽可能多的具有最大大小的缓冲区 直到失败。 (失败 ->
myAddr[i]
是NULL
我想,我忘了)。我能够分配大约 2.5GB 的缓冲区,在我那台只有 4GiB 内存的小机器上,每个缓冲区大小为 4MiB。当然,缓冲区的总数根据内核模块加载的时间而变化。在启动时加载驱动程序并分配大部分缓冲区。在我的系统中,每个单独缓冲区的大小最大为 4MiB。不知道为什么。我cat
ted/proc/buddyinfo
以确保我没有做任何愚蠢的事情,这当然是我通常的起始模式。然后,驱动程序将
pci_addr
数组及其大小提供给 PCIe 设备。然后驱动程序就坐在那里等待中断风暴开始。同时在用户空间中,应用程序打开驱动程序,查询分配的缓冲区数量(n)及其大小(使用ioctl或read等),然后继续调用系统调用mmap()
多次 (n)。当然,mmap()
必须在驱动程序中正确实现,LDD3 第 422-423 页很方便。用户空间现在有 n 个指针,指向驱动程序内存的 n 个区域。当驱动程序被 PCIe 设备中断时,它会被告知哪些缓冲区“已满”或“可用”以被吸干。应用程序又等待
read()
或ioctl()
来告知哪些缓冲区充满了有用数据。棘手的部分是管理用户空间到内核空间的同步,以便 PCIe DMA 进入的缓冲区也不会被用户空间修改,但这就是我们得到的报酬。我希望这是有道理的,如果有人告诉我我是个白痴,我会非常高兴,但请告诉我为什么。
顺便说一下,我也推荐这本书: http://www.amazon.com/ Linux编程接口系统手册/dp/1593272200。我希望七年前当我编写第一个 Linux 驱动程序时就拥有这本书。
还有另一种可能的欺骗方式,即添加更多内存而不让内核使用它,并在用户空间/内核空间划分的两侧进行 mmap ping,但 PCI 设备还必须支持高于 32 位的DMA 寻址。我没有尝试过,但如果我最终被迫这样做,我不会感到惊讶。
OK, this is what I did.
Disclaimer: I'm a hacker in the pure sense of the word and my code ain't the prettiest.
I read LDD3 and infiniband source code and other predecessor stuff and decided that
get_user_pages
and pinning them and all that other rigmarole was just too painful to contemplate while hungover. Also, I was working with the other person across the PCIe bus and I was also responsible in "designing" the user space application.I wrote the driver such that at load time, it preallocates as many buffers as it can with the largest size by calling the function
myAddr[i] = pci_alloc_consistent(blah,size,&pci_addr[i])
until it fails. (failure ->myAddr[i]
isNULL
I think, I forget). I was able to allocate around 2.5GB of buffers, each 4MiB in size in my meagre machine which only has 4GiB of memory. The total number of buffers varies depending on when the kernel module is loaded of course. Load the driver at boot time and the most buffers are allocated. Each individual buffer's size maxed out at 4MiB in my system. Not sure why. Icat
ted/proc/buddyinfo
to make sure I wasn't doing anything stupid which is of course my usual starting pattern.The driver then proceeds to give the array of
pci_addr
to the PCIe device along with their sizes. The driver then just sits there waiting for the interrupt storm to begin. Meanwhile in userspace, the application opens the driver, queries the number of allocated buffers(n) and their sizes (usingioctl
s orread
s etc) and then proceeds to call the system callmmap()
multiple (n) times. Of coursemmap()
must be properly implemented in the driver and LDD3 pages 422-423 were handy.Userspace now has n pointers to n areas of driver memory. As the driver is interrupted by the PCIe device, it's told which buffers are "full" or "available" to be sucked dry. The application in turn is pending on a
read()
orioctl()
to be told which buffers are full of useful data.The tricky part was to manage the userspace to kernel space synchronization such that buffers which are being DMA's into by the PCIe are not also being modified by userspace but that's what we get paid for. I hope this makes sense and I'd be more than happy to be told I'm an idiot but please tell me why.
I recommend this book as well by the way: http://www.amazon.com/Linux-Programming-Interface-System-Handbook/dp/1593272200 . I wish I had that book seven years ago when I wrote my first Linux driver.
There is another type of trickery possible by adding even more memory and not letting the kernel use it and
mmap
ping on both sides of the userspace/kernelspace divide but the PCI device must also support higher than 32-bit DMA addressing. I haven't tried but I wouldn't be surprised if I'll eventually be forced to.好吧,如果你有 LDD,你可以看看第 15 章,更准确地说是第 435 页,其中描述了直接 I/O 操作。
可以帮助您实现此目的的内核调用是
get_user_pages
。在您的情况下,由于您想要将数据从内核发送到用户空间,因此应该将写入标志设置为 1。另请注意,异步 I/O 可能允许您实现相同的结果,但用户空间应用程序不必等待读完这可能会更好。
Well, if you have LDD, you can have a look at chapter 15, and more precisely page 435, where Direct I/O operations are described.
The kernel call that will help you achieve this is
get_user_pages
. In your case since you want to send data from kernel to userspace, you should set the write flag to 1.Be also aware that the asynchronous I/O may allow you to achieve the same results but with your userspace application not having to wait for the read to finish which can be better.
仔细看看 Infiniband 驱动程序。他们付出了很大的努力来使零复制 DMA 和 RDMA 能够在用户空间工作。
我忘记在保存之前添加这一点:
直接对用户空间内存映射进行 DMA 充满了问题,因此除非您有非常高的性能要求(例如 Infiniband 或 10 Gb 以太网),否则不要这样做。相反,将 DMA 数据复制到用户空间缓冲区中。它会为你省去很多悲伤。
仅举一个例子,如果用户的程序在 DMA 完成之前退出怎么办?如果用户内存在退出后被重新分配给另一个进程,但硬件仍设置为 DMA 进入该页面怎么办?灾难!
Take a good look at the Infiniband drivers. They go to much effort to make zero-copy DMA and RDMA to user-space work.
I forgot to add this before saving:
Doing DMA directly to user-space memory mappings is full of problems, so unless you have very high performance requirements like Infiniband or 10 Gb Ethernet, don't do it. Instead, copy the DMA'd data into the userspace buffers. It will save you much grief.
For just one example, what if the user's program exits before the DMA is complete? What if the user memory is reallocated to another process after exit but the hardware is still set to DMA into that page? Disaster!
remap_pfn_range 函数(在驱动程序中的 mmap 调用中使用)可用于将内核内存映射到用户空间。
一个真实的例子可以在mem字符驱动drivers/char/mem.c。
remap_pfn_range function (used in mmap call in driver) can be used to map kernel memory to user space.
A real example could be found in mem character driver drivers/char/mem.c.