ioremap 后内存访问非常慢

发布于 2024-10-07 14:28:32 字数 1677 浏览 7 评论 0原文

我正在开发一个 Linux 内核驱动程序,该驱动程序使一大块物理内存可供用户空间使用。我有驱动程序的工作版本,但目前速度非常慢。因此,我退回了几个步骤并尝试制作一个小型、简单的驱动程序来重现问题。

我在启动时使用内核参数 memmap=2G$1G 保留内存。然后,在驱动程序的 __init 函数中,我 ioremap 一些内存,并将其初始化为已知值。我还输入了一些代码来测量时间:

#define RESERVED_REGION_SIZE    (1 * 1024 * 1024 * 1024)   // 1GB
#define RESERVED_REGION_OFFSET  (1 * 1024 * 1024 * 1024)   // 1GB

static int __init memdrv_init(void)
{
    struct timeval t1, t2;
    printk(KERN_INFO "[memdriver] init\n");

    // Remap reserved physical memory (that we grabbed at boot time)
    do_gettimeofday( &t1 );
    reservedBlock = ioremap( RESERVED_REGION_OFFSET, RESERVED_REGION_SIZE );
    do_gettimeofday( &t2 );
    printk( KERN_ERR "[memdriver] ioremap() took %d usec\n", usec_diff( &t2, &t1 ) );

    // Set the memory to a known value
    do_gettimeofday( &t1 );
    memset( reservedBlock, 0xAB, RESERVED_REGION_SIZE );
    do_gettimeofday( &t2 );
    printk( KERN_ERR "[memdriver] memset() took %d usec\n", usec_diff( &t2, &t1 ) );

    // Register the character device
    ...

    return 0;
}

我加载驱动程序,并检查 dmesg。它报告:

[memdriver] init
[memdriver] ioremap() took 76268 usec
[memdriver] memset() took 12622779 usec

memset 需要 12.6 秒。这意味着 memset 的运行速度为 81 MB/秒。到底为什么这么慢?

这是 Fedora 13 上的内核 2.6.34,它是一个 x86_64 系统。

编辑:

该方案背后的目标是获取一块物理内存并使其可供 PCI 设备(通过内存的总线/物理地址)和用户空间应用程序(通过调用 mmap,由驱动程序支持)。然后 PCI 设备将不断地用数据填充该内存,然后用户空间应用程序将其读出。如果 ioremap 是一种不好的方法(正如下面 Ben 建议的那样),我愿意接受其他建议,这些建议将允许我获得可以由两个硬件直接访问的任何大块内存和软件。我也许也可以使用较小的缓冲区。


请参阅下面我的最终解决方案。

I'm working on a Linux kernel driver that makes a chunk of physical memory available to user space. I have a working version of the driver, but it's currently very slow. So, I've gone back a few steps and tried making a small, simple driver to recreate the problem.

I reserve the memory at boot time using the kernel parameter memmap=2G$1G. Then, in the driver's __init function, I ioremap some of this memory, and initialize it to a known value. I put in some code to measure the timing as well:

#define RESERVED_REGION_SIZE    (1 * 1024 * 1024 * 1024)   // 1GB
#define RESERVED_REGION_OFFSET  (1 * 1024 * 1024 * 1024)   // 1GB

static int __init memdrv_init(void)
{
    struct timeval t1, t2;
    printk(KERN_INFO "[memdriver] init\n");

    // Remap reserved physical memory (that we grabbed at boot time)
    do_gettimeofday( &t1 );
    reservedBlock = ioremap( RESERVED_REGION_OFFSET, RESERVED_REGION_SIZE );
    do_gettimeofday( &t2 );
    printk( KERN_ERR "[memdriver] ioremap() took %d usec\n", usec_diff( &t2, &t1 ) );

    // Set the memory to a known value
    do_gettimeofday( &t1 );
    memset( reservedBlock, 0xAB, RESERVED_REGION_SIZE );
    do_gettimeofday( &t2 );
    printk( KERN_ERR "[memdriver] memset() took %d usec\n", usec_diff( &t2, &t1 ) );

    // Register the character device
    ...

    return 0;
}

I load the driver, and check dmesg. It reports:

[memdriver] init
[memdriver] ioremap() took 76268 usec
[memdriver] memset() took 12622779 usec

That's 12.6 seconds for the memset. That means the memset is running at 81 MB/sec. Why on earth is it so slow?

This is kernel 2.6.34 on Fedora 13, and it's an x86_64 system.

EDIT:

The goal behind this scheme is to take a chunk of physical memory and make it available to both a PCI device (via the memory's bus/physical address) and a user space application (via a call to mmap, supported by the driver). The PCI device will then continually fill this memory with data, and the user-space app will read it out. If ioremap is a bad way to do this (as Ben suggested below), I'm open to other suggestions that'll allow me to get any large chunk of memory that can be directly accessed by both hardware and software. I can probably make do with a smaller buffer also.


See my eventual solution below.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

疏忽 2024-10-14 14:28:32

ioremap 分配不可缓存的页面,因为您需要访问内存映射 io 设备。这可以解释你表现不佳的原因。

您可能需要 kmallocvmalloc通常 参考 材料将解释每个人的能力。

ioremap allocates uncacheable pages, as you'd desire for access to a memory-mapped-io device. That would explain your poor performance.

You probably want kmalloc or vmalloc. The usual reference materials will explain the capabilities of each.

梦屿孤独相伴 2024-10-14 14:28:32

我不认为 ioremap() 是你想要的。您只能使用 readbreadlwriteb访问结果(您称之为 reservedBlock) memcpy_toio 等。甚至不能保证返回值是虚拟映射的(尽管它显然在您的平台上)。我猜想该区域被映射为未缓存(适合 IO 寄存器),导致性能糟糕。

I don't think ioremap() is what you want there. You should only access the result (what you call reservedBlock) with readb, readl, writeb, memcpy_toio etc. It is not even guaranteed that the return is virtually mapped (although it apparently is on your platform). I'd guess that the region is being mapped uncached (suitable for IO registers) leading to the terrible performance.

辞取 2024-10-14 14:28:32

已经有一段时间了,但我正在更新,因为我最终找到了这个 ioremap 问题的解决方法。

由于我们有自定义硬件直接写入内存,因此将其标记为不可缓存可能更正确,但它速度慢得难以忍受,并且不适用于我们的应用程序。我们的解决方案是,只有当有足够的新数据填充我们架构上的整个缓存行(我认为是 256 字节)时,才从该内存(环形缓冲区)中读取数据。这保证了我们永远不会得到过时的数据,而且速度非常快。

It's been a while, but I'm updating since I did eventually find a workaround for this ioremap problem.

Since we had custom hardware writing directly to the memory, it was probably more correct to mark it uncacheable, but it was unbearably slow and wasn't working for our application. Our solution was to only read from that memory (a ring buffer) once there was enough new data to fill a whole cache line on our architecture (I think that was 256 bytes). This guaranteed we never got stale data, and it was plenty fast.

寻梦旅人 2024-10-14 14:28:32

我尝试使用 memmap 进行巨大的内存块预留。

该块的 ioremap ping 为我提供了一个超过几兆字节的映射内存地址空间。

当您要求保留 128GB 内存(64 GB 起)时。您会在 /proc/vmallocinfo 中看到以下内容,

0xffffc9001f3a8000-0xffffc9201f3a9000 137438957568 0xffffffffa00831c9 phys=1000000000 ioremap

因此地址空间从 0xffffc9001f3a8000 开始(这太大了)。

其次,你的观察是正确的。即使是 memset_io 也会导致接触所有这些内存的极大延迟(数十分钟)。

因此,所花费的时间主要与地址空间转换和不可缓存页面加载有关。

I have tried out doing a huge memory chunk reservations with the memmap

The ioremapping of this chunk gave me a mapped memory address space which in beyond few tera bytes.

when you ask to reserve 128GB memory starting at 64 GB. you see the following in /proc/vmallocinfo

0xffffc9001f3a8000-0xffffc9201f3a9000 137438957568 0xffffffffa00831c9 phys=1000000000 ioremap

Thus the address space starts at 0xffffc9001f3a8000 (which is waay too large).

Secondly, Your observation is correct. even the memset_io results in a extremely large delays (in tens of minutes) to touch all this memory.

So, the time taken has to do mainly with address space conversion and non cacheable page loading.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文