ioremap 后内存访问非常慢
我正在开发一个 Linux 内核驱动程序,该驱动程序使一大块物理内存可供用户空间使用。我有驱动程序的工作版本,但目前速度非常慢。因此,我退回了几个步骤并尝试制作一个小型、简单的驱动程序来重现问题。
我在启动时使用内核参数 memmap=2G$1G
保留内存。然后,在驱动程序的 __init 函数中,我 ioremap
一些内存,并将其初始化为已知值。我还输入了一些代码来测量时间:
#define RESERVED_REGION_SIZE (1 * 1024 * 1024 * 1024) // 1GB
#define RESERVED_REGION_OFFSET (1 * 1024 * 1024 * 1024) // 1GB
static int __init memdrv_init(void)
{
struct timeval t1, t2;
printk(KERN_INFO "[memdriver] init\n");
// Remap reserved physical memory (that we grabbed at boot time)
do_gettimeofday( &t1 );
reservedBlock = ioremap( RESERVED_REGION_OFFSET, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] ioremap() took %d usec\n", usec_diff( &t2, &t1 ) );
// Set the memory to a known value
do_gettimeofday( &t1 );
memset( reservedBlock, 0xAB, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] memset() took %d usec\n", usec_diff( &t2, &t1 ) );
// Register the character device
...
return 0;
}
我加载驱动程序,并检查 dmesg。它报告:
[memdriver] init
[memdriver] ioremap() took 76268 usec
[memdriver] memset() took 12622779 usec
memset 需要 12.6 秒。这意味着 memset 的运行速度为 81 MB/秒。到底为什么这么慢?
这是 Fedora 13 上的内核 2.6.34,它是一个 x86_64 系统。
编辑:
该方案背后的目标是获取一块物理内存并使其可供 PCI 设备(通过内存的总线/物理地址)和用户空间应用程序(通过调用 mmap
,由驱动程序支持)。然后 PCI 设备将不断地用数据填充该内存,然后用户空间应用程序将其读出。如果 ioremap 是一种不好的方法(正如下面 Ben 建议的那样),我愿意接受其他建议,这些建议将允许我获得可以由两个硬件直接访问的任何大块内存和软件。我也许也可以使用较小的缓冲区。
请参阅下面我的最终解决方案。
I'm working on a Linux kernel driver that makes a chunk of physical memory available to user space. I have a working version of the driver, but it's currently very slow. So, I've gone back a few steps and tried making a small, simple driver to recreate the problem.
I reserve the memory at boot time using the kernel parameter memmap=2G$1G
. Then, in the driver's __init
function, I ioremap
some of this memory, and initialize it to a known value. I put in some code to measure the timing as well:
#define RESERVED_REGION_SIZE (1 * 1024 * 1024 * 1024) // 1GB
#define RESERVED_REGION_OFFSET (1 * 1024 * 1024 * 1024) // 1GB
static int __init memdrv_init(void)
{
struct timeval t1, t2;
printk(KERN_INFO "[memdriver] init\n");
// Remap reserved physical memory (that we grabbed at boot time)
do_gettimeofday( &t1 );
reservedBlock = ioremap( RESERVED_REGION_OFFSET, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] ioremap() took %d usec\n", usec_diff( &t2, &t1 ) );
// Set the memory to a known value
do_gettimeofday( &t1 );
memset( reservedBlock, 0xAB, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] memset() took %d usec\n", usec_diff( &t2, &t1 ) );
// Register the character device
...
return 0;
}
I load the driver, and check dmesg. It reports:
[memdriver] init
[memdriver] ioremap() took 76268 usec
[memdriver] memset() took 12622779 usec
That's 12.6 seconds for the memset. That means the memset is running at 81 MB/sec. Why on earth is it so slow?
This is kernel 2.6.34 on Fedora 13, and it's an x86_64 system.
EDIT:
The goal behind this scheme is to take a chunk of physical memory and make it available to both a PCI device (via the memory's bus/physical address) and a user space application (via a call to mmap
, supported by the driver). The PCI device will then continually fill this memory with data, and the user-space app will read it out. If ioremap
is a bad way to do this (as Ben suggested below), I'm open to other suggestions that'll allow me to get any large chunk of memory that can be directly accessed by both hardware and software. I can probably make do with a smaller buffer also.
See my eventual solution below.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
ioremap
分配不可缓存的页面,因为您需要访问内存映射 io 设备。这可以解释你表现不佳的原因。您可能需要
kmalloc
或vmalloc
。 通常 参考 材料将解释每个人的能力。ioremap
allocates uncacheable pages, as you'd desire for access to a memory-mapped-io device. That would explain your poor performance.You probably want
kmalloc
orvmalloc
. The usual reference materials will explain the capabilities of each.我不认为
ioremap()
是你想要的。您只能使用readb
、readl
、writeb
、访问结果(您称之为
reservedBlock
) memcpy_toio 等。甚至不能保证返回值是虚拟映射的(尽管它显然在您的平台上)。我猜想该区域被映射为未缓存(适合 IO 寄存器),导致性能糟糕。I don't think
ioremap()
is what you want there. You should only access the result (what you callreservedBlock
) withreadb
,readl
,writeb
,memcpy_toio
etc. It is not even guaranteed that the return is virtually mapped (although it apparently is on your platform). I'd guess that the region is being mapped uncached (suitable for IO registers) leading to the terrible performance.已经有一段时间了,但我正在更新,因为我最终找到了这个 ioremap 问题的解决方法。
由于我们有自定义硬件直接写入内存,因此将其标记为不可缓存可能更正确,但它速度慢得难以忍受,并且不适用于我们的应用程序。我们的解决方案是,只有当有足够的新数据填充我们架构上的整个缓存行(我认为是 256 字节)时,才从该内存(环形缓冲区)中读取数据。这保证了我们永远不会得到过时的数据,而且速度非常快。
It's been a while, but I'm updating since I did eventually find a workaround for this ioremap problem.
Since we had custom hardware writing directly to the memory, it was probably more correct to mark it uncacheable, but it was unbearably slow and wasn't working for our application. Our solution was to only read from that memory (a ring buffer) once there was enough new data to fill a whole cache line on our architecture (I think that was 256 bytes). This guaranteed we never got stale data, and it was plenty fast.
我尝试使用 memmap 进行巨大的内存块预留。
该块的 ioremap ping 为我提供了一个超过几兆字节的映射内存地址空间。
当您要求保留 128GB 内存(64 GB 起)时。您会在 /proc/vmallocinfo 中看到以下内容,
因此地址空间从 0xffffc9001f3a8000 开始(这太大了)。
其次,你的观察是正确的。即使是
memset_io
也会导致接触所有这些内存的极大延迟(数十分钟)。因此,所花费的时间主要与地址空间转换和不可缓存页面加载有关。
I have tried out doing a huge memory chunk reservations with the
memmap
The
ioremap
ping of this chunk gave me a mapped memory address space which in beyond few tera bytes.when you ask to reserve 128GB memory starting at 64 GB. you see the following in /proc/vmallocinfo
Thus the address space starts at 0xffffc9001f3a8000 (which is waay too large).
Secondly, Your observation is correct. even the
memset_io
results in a extremely large delays (in tens of minutes) to touch all this memory.So, the time taken has to do mainly with address space conversion and non cacheable page loading.