spinlock_irqsave 与死锁

发布于 2024-12-01 07:01:38 字数 851 浏览 1 评论 0原文

我编写了内核模块,它执行 nf_register_hook 并使用字符设备机制通过设备读取挂钩将抓取的数据包获取到用户空间。我使用全局缓冲区和缓冲区大小变量,这就是为什么我需要在新数据包到来或用户读取我的设备时锁定它。 我使用了 splinlock_irqsave 和 spin_unlock_irqrestore(&locker,flags) 但我的模块陷入死锁并且系统冻结。

unsigned int main_hook(unsigned int hooknum, struct sk_buff *skb,
const struct net_device *in, const struct net_device *out,
int(*okfn)(struct sk_buff*)) {
unsigned long flags;
spin_lock_irqsave(&locker,flags);
...
spin_unlock_irqrestore(&locker,flags);
}

ssize_t sniffer_dev_read(struct file *filep, char *buff, size_t count, loff_t *offp) {
spin_lock_irqsave(&locker,flags);
...
spin_unlock_irqrestore(&locker,flags);
}

main_hook is registered in nf_register_hook()
sniffer_dev_read is registered in register_chrdev

当用户从设备读取时,系统陷入死锁。 想法? 或者可能是 irq 保存/恢复与 netfiler hook/char 设备读取不兼容,我必须在这里使用特殊锁定?

I wrote kernel module which do nf_register_hook and use character device mechanism to get grabbed packets to userspace with device read hooking. I use global buffer and buffersize vars that's why i need to lock it when new packet comes or user reading my device.
I used splinlock_irqsave and spin_unlock_irqrestore(&locker,flags) but my module went into deadlock and system freezes.

unsigned int main_hook(unsigned int hooknum, struct sk_buff *skb,
const struct net_device *in, const struct net_device *out,
int(*okfn)(struct sk_buff*)) {
unsigned long flags;
spin_lock_irqsave(&locker,flags);
...
spin_unlock_irqrestore(&locker,flags);
}

ssize_t sniffer_dev_read(struct file *filep, char *buff, size_t count, loff_t *offp) {
spin_lock_irqsave(&locker,flags);
...
spin_unlock_irqrestore(&locker,flags);
}

main_hook is registered in nf_register_hook()
sniffer_dev_read is registered in register_chrdev

when the user read from device, system go to deadlock.
ideas?
or may be irq save/restore incompatible with netfiler hook/char device read and i must use special locking here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

您的好友蓝忘机已上羡 2024-12-08 07:01:38

首先,您应该在启用锁定调试选项的情况下重新编译内核,然后重试。他们可以帮助找出原因。

spin_lock_irqsave 中发生死锁的可能原因有多种。它可能是递归锁定(也就是说,您尝试在锁定自旋锁的代码部分中再次调用 spin_lock_* )。可能是您在自旋锁锁定的情况下处于休眠状态(永远不要这样做 - 对于您在持有锁的情况下调用的每个函数,您必须知道它是否可以休眠)。这可能是 AB/BA 死锁(代码的一部分先锁定 A,然后再锁定 B;另一部分先锁定 B,然后锁定 A;如果第一部分锁定 A,但没有锁定 B,第二部分锁定 B,但没有锁定 A,则有一个僵局)。等等。锁定调试选项可以检测其中的许多情况并向您发出警告。

由于您要锁定的是“全局缓冲区和缓冲区大小变量”,因此请尝试将锁定区域减少到最小。不要在函数顶部锁定并在最后解锁,而是在锁定之外尽可能多地执行操作,并且仅在操作缓冲区时锁定。理想情况下,锁定部分只是一些没有函数调用的指令。在这种情况下,陷入僵局要困难得多。


既然我说了所有这些,我尝试进行心理调试(即猜测问题出在哪里):您正在调用 copy_to_user (可以休眠)并保持自旋锁。

First, you should recompile your kernel with the lock debugging options enabled and try it again. They can help point to the cause.

There are several possible causes for a deadlock in spin_lock_irqsave. It could be recursive locking (that is, you are trying to call spin_lock_* again within the section of code where you hold the spin lock locked). It could be that you are sleeping with the spin lock locked (do not do this ever - for each function you call with the lock held, you must known whether it can sleep or not). It could be an AB/BA deadlock (one part of the code locks A first and then B; another part locks B first then A; if the first part locked A but not B and the second part locked B but not A you have a deadlock). And so on. The lock debugging options can detect and warn you about many of these.

Since what you are locking is a "global buffer and buffersize vars", try to reduce the locked area to a minimum. Instead of locking at the top of the function and unlocking at the end, do as much as possible outside the lock and lock only while manipulating your buffer. Ideally, the locked section would be just a few instructions with no function calls. It is much harder to deadlock in that case.


Now that I said all that, my attempt at psychic debugging (i.e. guessing where the problem is): you are calling copy_to_user (which can sleep) with the spin lock held.

二货你真萌 2024-12-08 07:01:38

您不应使用自旋锁来锁定可从不同上下文级别使用的资源。它忙阻塞它所锁定的处理器。

main_hook 是从中断/下半上下文调用的吗?如果是这样,您可以使用 work_queues 以较低的优先级完成“作业”(memcpy ...)。作为一般规则,您应该在自旋锁内尽可能少地进行操作。

You should not use spinlocks for locking a resource which can be used from different context levels. It busy-blocks the processor on which it is locked.

Is main_hook called from interrupt/bottom-half-context? If so, you could use work_queues to have "job" (memcpy...) done at lower priority. As general rule you should do the minimum possible being inside a spinlock.

魂牵梦绕锁你心扉 2024-12-08 07:01:38

我的猜测是,您可能有一些简单的编程故障(例如,尝试遵循 NULL 等),但由于 spin_lock_irqsave 禁用中断,当您遇到异常模式时,您会禁用中断,因此整个机器被锁定。

请注意,由于 NF 挂钩在下半部分上下文中运行,因此您实际上不需要禁用中断 - 只需要下半部分。这将使调试变得更容易。

My guess would be that you have some probably simple programming glitch (e.g. try to deference NULL etc.), but since spin_lock_irqsave disables interrupts, when you hit that exception mode you go with interrupts disabled so the entire machine is locked.

Note that since you NF hook runs in bottom half context you don't really need to disable interrupts - just bottom halves. That will make it it easier to debug.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文