Linux 内核设备驱动程序将 DMA 从设备传输到用户空间内存
我希望尽快将数据从支持 DMA 的 PCIe 硬件设备获取到用户空间。
问:如何将“通过 DMA 传输直接 I/O 到用户空间”
通过 LDD3 读取,似乎我需要执行几种不同类型的 IO 操作!?
dma_alloc_coherent
为我提供了可以传递给硬件设备的物理地址。 但需要设置get_user_pages
并在传输完成时执行copy_to_user
类型调用。这看起来很浪费,要求设备 DMA 到内核内存(充当缓冲区),然后再次将其传输到用户空间。 LDD3 p453:/* 现在才可以安全地访问缓冲区、复制给用户等*/
我理想中想要的是一些内存:
- 我可以在用户空间中使用(也许通过 ioctl 调用请求驱动程序来创建 DMA'able 内存/缓冲区?)
- 我可以从中获取物理地址并传递给设备,以便所有用户空间所要做的就是在驱动程序上执行读取
- 读取方法将激活 DMA 传输,阻止等待 DMA 完成中断并随后释放用户空间读取(用户空间现在可以安全地使用/读取内存)。
我是否需要使用 get_user_pages
dma_map_page
映射的单页流映射、设置映射和用户空间缓冲区?
到目前为止,我的代码在用户空间的给定地址处设置了 get_user_pages
(我将其称为直接 I/O 部分)。然后,dma_map_page
包含来自 get_user_pages
的页面。我将 dma_map_page 的返回值作为 DMA 物理传输地址提供给设备。
我使用一些内核模块作为参考:drivers_scsi_st.c
和 drivers-net-sh_eth.c
。我会查看 infiniband 代码,但找不到哪一个是最基本的!
非常感谢。
I want to get data from a DMA enabled, PCIe hardware device into user-space as quickly as possible.
Q: How do I combine "direct I/O to user-space with/and/via a DMA transfer"
Reading through LDD3, it seems that I need to perform a few different types of IO operations!?
dma_alloc_coherent
gives me the physical address that I can pass to the hardware device.
But would need to have setupget_user_pages
and perform acopy_to_user
type call when the transfer completes. This seems a waste, asking the Device to DMA into kernel memory (acting as buffer) then transferring it again to user-space.
LDD3 p453:/* Only now is it safe to access the buffer, copy to user, etc. */
What I ideally want is some memory that:
- I can use in user-space (Maybe request driver via a ioctl call to create DMA'able memory/buffer?)
- I can get a physical address from to pass to the device so that all user-space has to do is perform a read on the driver
- the read method would activate the DMA transfer, block waiting for the DMA complete interrupt and release the user-space read afterwards (user-space is now safe to use/read memory).
Do I need single-page streaming mappings, setup mapping and user-space buffers mapped with get_user_pages
dma_map_page
?
My code so far sets up get_user_pages
at the given address from user-space (I call this the Direct I/O part). Then, dma_map_page
with a page from get_user_pages
. I give the device the return value from dma_map_page
as the DMA physical transfer address.
I am using some kernel modules as reference: drivers_scsi_st.c
and drivers-net-sh_eth.c
. I would look at infiniband code, but cant find which one is the most basic!
Many thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
实际上,我现在正在做完全相同的事情,并且我将采用 ioctl() 路线。总体思路是用户空间分配用于 DMA 传输的缓冲区,并使用 ioctl() 将此缓冲区的大小和地址传递给设备驱动程序。然后,驱动程序将使用分散-收集列表以及流式 DMA API 直接在设备和用户空间缓冲区之间传输数据。
我使用的实现策略是驱动程序中的 ioctl() 进入一个循环,DMA 将用户空间缓冲区以 256k 块为单位(这是硬件对其分散/收集条目数量施加的限制)可以处理)。这是在一个函数内隔离的,该函数会阻塞直到每次传输完成(见下文)。当所有字节都传输完毕或增量传输函数返回错误时,
ioctl()
退出并返回到用户空间ioctl()
的伪代码增量传输函数的伪代码:
中断处理程序非常简短:
请注意,这只是一种通用方法,过去几周我一直在研究这个驱动程序,但尚未实际测试它......所以请不要对待这个伪代码作为福音,请务必仔细检查所有逻辑和参数;-)。
I'm actually working on exactly the same thing right now and I'm going the
ioctl()
route. The general idea is for user space to allocate the buffer which will be used for the DMA transfer and anioctl()
will be used to pass the size and address of this buffer to the device driver. The driver will then use scatter-gather lists along with the streaming DMA API to transfer data directly to and from the device and user-space buffer.The implementation strategy I'm using is that the
ioctl()
in the driver enters a loop that DMA's the userspace buffer in chunks of 256k (which is the hardware imposed limit for how many scatter/gather entries it can handle). This is isolated inside a function that blocks until each transfer is complete (see below). When all bytes are transfered or the incremental transfer function returns an error theioctl()
exits and returns to userspacePseudo code for the
ioctl()
Pseudo code for incremental transfer function:
The interrupt handler is exceptionally brief:
Please note that this is just a general approach, I've been working on this driver for the last few weeks and have yet to actually test it... So please, don't treat this pseudo code as gospel and be sure to double check all logic and parameters ;-).
你基本上有正确的想法:在 2.1 中,你可以让用户空间分配任何旧内存。您确实希望它页面对齐,因此 posix_memalign() 是一个方便使用的 API。
然后让用户空间以某种方式传入用户空间虚拟地址和该缓冲区的大小; ioctl() 是一种快速但肮脏的方法。在内核中,分配适当大小的
struct page*
缓冲区数组 -user_buf_size/PAGE_SIZE
条目 - 并使用get_user_pages()
获取用户空间缓冲区的 struct page* 列表。完成后,您可以分配一个与页面数组大小相同的
struct scatterlist
数组,并循环执行sg_set_page()
的页面列表。设置好 sg 列表后,对 scatterlist 数组执行dma_map_sg()
即可获取每个列表的sg_dma_address
和sg_dma_len
分散列表中的条目(请注意,您必须使用dma_map_sg()
的返回值,因为 DMA 映射代码可能会合并事物,因此最终可能会得到较少的映射条目)。这为您提供了要传递到设备的所有总线地址,然后您可以触发 DMA 并根据需要等待它。您所拥有的基于 read() 的方案可能没问题。
您可以参考 drivers/infiniband/core/umem.c,特别是
ib_umem_get()
,了解构建此映射的一些代码,尽管该代码需要处理的通用性可能会使其有点复杂令人困惑。或者,如果您的设备不能很好地处理分散/聚集列表,并且您需要连续的内存,则可以使用
get_free_pages()
分配物理上连续的缓冲区并使用dma_map_page()
代码> 就可以了。要让用户空间访问该内存,您的驱动程序只需实现mmap
方法,而不是如上所述的 ioctl。You basically have the right idea: in 2.1, you can just have userspace allocate any old memory. You do want it page-aligned, so
posix_memalign()
is a handy API to use.Then have userspace pass in the userspace virtual address and size of this buffer somehow; ioctl() is a good quick and dirty way to do this. In the kernel, allocate an appropriately sized buffer array of
struct page*
--user_buf_size/PAGE_SIZE
entries -- and useget_user_pages()
to get a list of struct page* for the userspace buffer.Once you have that, you can allocate an array of
struct scatterlist
that is the same size as your page array and loop through the list of pages doingsg_set_page()
. After the sg list is set up, you dodma_map_sg()
on the array of scatterlist and then you can get thesg_dma_address
andsg_dma_len
for each entry in the scatterlist (note you have to use the return value ofdma_map_sg()
because you may end up with fewer mapped entries because things might get merged by the DMA mapping code).That gives you all the bus addresses to pass to your device, and then you can trigger the DMA and wait for it however you want. The read()-based scheme you have is probably fine.
You can refer to drivers/infiniband/core/umem.c, specifically
ib_umem_get()
, for some code that builds up this mapping, although the generality that that code needs to deal with may make it a bit confusing.Alternatively, if your device doesn't handle scatter/gather lists too well and you want contiguous memory, you could use
get_free_pages()
to allocate a physically contiguous buffer and usedma_map_page()
on that. To give userspace access to that memory, your driver just needs to implement anmmap
method instead of the ioctl as described above.在某些时候,我希望允许用户空间应用程序分配 DMA 缓冲区并将其映射到用户空间并获取物理地址,以便能够完全从用户空间控制我的设备并执行 DMA 事务(总线主控)绕过Linux内核。不过,我使用了一些不同的方法。首先,我从一个最小的内核模块开始,该模块初始化/探测 PCIe 设备并创建一个字符设备。然后,该驱动程序允许用户空间应用程序执行两件事:
remap_pfn_range()
函数将 PCIe 设备的 I/O bar 映射到用户空间。基本上,它可以归结为
mmap()
调用的自定义实现(尽管file_operations
)。用于 I/O 栏的一个很简单:而另一个使用 pci_alloc_concient() 分配 DMA 缓冲区的则稍微复杂一些:
一旦这些就位,用户空间应用程序几乎可以做所有事情 - 控制通过从 I/O 寄存器读取/写入、分配和释放任意大小的 DMA 缓冲区以及让设备执行 DMA 事务来控制设备。唯一缺少的部分是中断处理。我正在用户空间中进行轮询,烧坏了我的 CPU,并且禁用了中断。
希望有帮助。祝你好运!
At some point I wanted to allow user-space application to allocate DMA buffers and get it mapped to user-space and get the physical address to be able to control my device and do DMA transactions (bus mastering) entirely from user-space, totally bypassing the Linux kernel. I have used a little bit different approach though. First I started with a minimal kernel module that was initializing/probing PCIe device and creating a character device. That driver then allowed a user-space application to do two things:
remap_pfn_range()
function.Basically, it boils down to a custom implementation of
mmap()
call (thoughfile_operations
). One for I/O bar is easy:And another one that allocates DMA buffers using
pci_alloc_consistent()
is a little bit more complicated:Once those are in place, user space application can pretty much do everything — control the device by reading/writing from/to I/O registers, allocate and free DMA buffers of arbitrary size, and have the device perform DMA transactions. The only missing part is interrupt-handling. I was doing polling in user space, burning my CPU, and had interrupts disabled.
Hope it helps. Good Luck!
在设计驱动程序时考虑应用程序。
数据移动的性质、频率、大小以及系统中可能发生的其他情况是什么?
传统的读/写 API 是否足够?
直接将设备映射到用户空间可以吗?
反射(半相干)共享内存是可取的吗?
如果数据易于理解,那么手动操作数据(读/写)是一个不错的选择。对于内联副本,使用通用 VM 和读/写可能就足够了。直接映射非缓存访问外设很方便,但可能很笨拙。如果访问是相对不频繁的大块移动,则使用常规内存、具有驱动器引脚、转换地址、DMA 和释放页面可能是有意义的。作为一项优化,页面(可能很大)可以预先固定和翻译;然后,驱动器可以识别准备好的内存并避免动态转换的复杂性。如果存在大量小型 I/O 操作,则让驱动器异步运行是有意义的。如果优雅很重要,VM 脏页标志可用于自动识别需要移动的内容,并且 (meta_sync()) 调用可用于刷新页面。也许是上述工作的混合......
人们常常在深入研究细节之前不考虑更大的问题。通常最简单的解决方案就足够了。构建行为模型的一点努力可以帮助指导哪些 API 更可取。
Consider the application when designing a driver.
What is the nature of data movement, frequency, size and what else might be going on in the system?
Is the traditional read/write API sufficient?
Is direct mapping the device into user space OK?
Is a reflective (semi-coherent) shared memory desirable?
Manually manipulating data (read/write) is a pretty good option if the data lends itself to being well understood. Using general purpose VM and read/write may be sufficient with an inline copy. Direct mapping non cachable accesses to the peripheral is convenient, but can be clumsy. If the access is the relatively infrequent movement of large blocks, it may make sense to use regular memory, have the drive pin, translate addresses, DMA and release the pages. As an optimization, the pages (maybe huge) can be pre pinned and translated; the drive then can recognize the prepared memory and avoid the complexities of dynamic translation. If there are lots of little I/O operations, having the drive run asynchronously makes sense. If elegance is important, the VM dirty page flag can be used to automatically identify what needs to be moved and a (meta_sync()) call can be used to flush pages. Perhaps a mixture of the above works...
Too often people don't look at the larger problem, before digging into the details. Often the simplest solutions are sufficient. A little effort constructing a behavioral model can help guide what API is preferable.
看来是不对的。它应该是:
或
It seems wrong. It should be either:
or
值得一提的是,支持 Scatter-Gather DMA 和用户空间内存分配的驱动程序效率最高,性能最高。然而,如果我们不需要高性能或者我们想在一些简化的条件下开发驱动程序,我们可以使用一些技巧。
放弃零拷贝设计。当数据吞吐量不是太大时值得考虑。在这样的设计中,数据可以通过复制给用户
copy_to_user(user_buffer, kernel_dma_buffer, count);
user_buffer 可能是字符设备 read() 系统调用实现中的缓冲区参数。我们仍然需要处理
kernel_dma_buffer
分配。例如,它可能是通过从 dma_alloc_coherent() 调用获取的内存。另一个技巧是在启动时限制系统内存,然后将其用作巨大的连续 DMA 缓冲区。它在驱动程序和 FPGA DMA 控制器开发期间特别有用,但不建议在生产环境中使用。假设 PC 有 32GB RAM。如果我们将
mem=20GB
添加到内核启动参数列表中,我们可以使用 12GB 作为巨大的连续 dma 缓冲区。要将这块内存映射到用户空间,只需实现 mmap() 即可,当然,操作系统完全忽略了这 12GB,并且只能由已将其映射到其地址空间的进程使用。我们可以尝试使用连续内存分配器(CMA)来避免这种情况。
同样,上述技巧不会取代完整的 Scatter-Gather、零复制 DMA 驱动程序,但在开发期间或在某些性能较低的平台中很有用。
It is worth mention that driver with Scatter-Gather DMA support and user space memory allocation is most efficient and has highest performance. However in case we don't need high performance or we want to develop a driver in some simplified conditions we can use some tricks.
Give up zero copy design. It is worth to consider when data throughput is not too big. In such a design data can by copied to user by
copy_to_user(user_buffer, kernel_dma_buffer, count);
user_buffer might be for example buffer argument in character device read() system call implementation. We still need to take care of
kernel_dma_buffer
allocation. It might by memory obtained fromdma_alloc_coherent()
call for example.The another trick is to limit system memory at the boot time and then use it as huge contiguous DMA buffer. It is especially useful during driver and FPGA DMA controller development and rather not recommended in production environments. Lets say PC has 32GB of RAM. If we add
mem=20GB
to kernel boot parameters list we can use 12GB as huge contiguous dma buffer. To map this memory to user space simply implement mmap() asOf course this 12GB is completely omitted by OS and can be used only by process which has mapped it into its address space. We can try to avoid it by using Contiguous Memory Allocator (CMA).
Again above tricks will not replace full Scatter-Gather, zero copy DMA driver, but are useful during development time or in some less performance platforms.