如何解决/绕过似乎与 GIL 相关的锁定问题
两个线程之间的模糊锁定似乎与全局解释器锁或其他一些“幕后锁”有关,我不知道如何继续进行故障排除。任何有关如何消除锁定的提示将不胜感激。
该问题在更大的代码集中重现(不稳定且有些随机)。代码严格来说是Python的。 Python 版本是 2.6.5(在 Linux 上)。经过数小时的故障排除,已将发生锁定时的问题减少到以下情况:
- 程序只有两个正在运行的线程 线程
- 同时调用受单线程保护的方法。RLock
- 线程 1 已通过以下方式获取锁 [加上一些其他锁] acquire()
- 线程 2 已调用 acquire() 并被确认正在等待锁
- 线程 1 能够使用 print() 打印到控制台,但是它被一个简单的非阻塞库调用
锁定调用 #5 是函数 unicode.encode,它应该是非阻塞的。线程 1 中线程锁定位置的以下代码将(按预期)打印“A”和“B”:
print('A')
print('B')
但是,以下代码只会打印“A”并阻塞线程:
print('A')
u'hello'.encode('utf8') # This dummy (non-blocking) call locks up Thread 1
print('B')
这对我来说根本没有意义。两个线程之间不存在逻辑死锁情况。线程 1 被非阻塞库调用阻塞,该库调用不会以任何方式干扰线程 2,线程 2 只是静静地等待获取 RLock。我能想到线程 1 被阻塞的唯一原因是它正在等待 GIL。
有什么想法如何进一步解决此问题,或者有任何机制可以以某种方式控制或操纵 GIL 操作作为解决方法吗?
编辑:针对样本偏差的一些附加信息(并感谢您的回复)。我在获取跟踪时遇到问题,因为该问题似乎对任何可能干扰两个线程之间的计时的事物都非常敏感。然而,仅使用 -f 选项运行 strace,经过几次迭代后我能够获得跟踪。
线程 1 包含这三个调试语句,应在控制台上打印两行“CHECK_IN”和“CHECK_TEST”:
print('CHECK IN')#DEBUG
u'hello'.encode('utf8')
print('CHECK TEST')#DEBUG
这是 strace 的最后一页:
8605 mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb753d000
8605 mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb6d3c000
8605 mprotect(0xb6d3c000, 4096, PROT_NONE) = 0
8605 clone(child_stack=0xb753c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb753cbd8, {entry_number:6, base_addr:0xb753cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb753cbd8) = 8606
8606 set_robust_list(0xb753cbe0, 0xc <unfinished ...>
8605 futex(0xa239138, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... set_robust_list resumed> ) = 0
8606 futex(0xa239138, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 gettimeofday( <unfinished ...>
8605 futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... gettimeofday resumed> {1301528807, 326496}, NULL) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605 gettimeofday( <unfinished ...>
8606 <... futex resumed> ) = 0
8605 <... gettimeofday resumed> {1301528807, 326821}, NULL) = 0
8606 futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8605 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8606 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
8605 <... futex resumed> ) = 0
8606 gettimeofday( <unfinished ...>
8605 futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... gettimeofday resumed> {1301528807, 326908}, NULL) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605 futex(0xa1b0d70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... futex resumed> ) = 0
8606 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2225, ...}) = 0
8606 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
8606 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6d3b000
8606 write(1, "CHECK IN\n", 9) = 9
8606 futex(0xa115270, FUTEX_WAIT_PRIVATE, 0, NULL
在程序锁定之前,这三行代码的输出如下up:
CHECK IN
所以 strace 显示了线程 1 (#8606) 如何写入“CHECK_IN”字符串,以及当到达 unicode.encode 调用时如何进入永远不会返回的等待状态。
顺便说一句,我正在所有模块中进行一些未来的导入,以符合一些较新的 python 约定……
from __future__ import print_function, unicode_literals
但我看不出它们应该有任何区别 - 特别是当 u'hello' 字符串被显式调用为统一码字符串。
An obscure lock-up between two threads seems to be related to the Global Interpreter Lock or some other "behind-the-scenes-lock", and I am at a loss how to continue troubleshooting. Any tips how to eliminate the lock-up would be appreciated.
The issue reproduces (erratically and somewhat random) in a larger set of code. The code is strictly python. Python version is 2.6.5 (on Linux). Hours of troubleshooting has reduced the problem when the lock-up occurs to the following:
- The program has only two running threads
- The threads concurrently call a method protected by a single threading.RLock
- Thread 1 has acquired the lock [plus some other locks] via acquire()
- Thread 2 has called acquire() and is confirmed to be waiting for the lock
- Thread 1 is able to do print to the console with print(), however it is getting locked up by a simple non-blocking library call
The offensive call in #5 is the function unicode.encode, which should be non-blocking. The following code in Thread 1 at the position where the thread locks will (as expected) print 'A' and 'B':
print('A')
print('B')
However, the following will just print 'A' and block the thread:
print('A')
u'hello'.encode('utf8') # This dummy (non-blocking) call locks up Thread 1
print('B')
This makes no sense to me at all. There is no logical dead-lock condition existing between the two threads. Thread 1 is being blocked by a non-blocking library call that is not in any way interfering with Thread 2, which is just silently waiting to acquire an RLock. The only cause I can think of for Thread 1 being blocked is that it is waiting for the GIL.
Any thoughts how to troubleshoot this further, or any mechanisms to somehow control or manipulate GIL operation as a workaround?
Edit: Some additional information in response to samplebias (and thank you for the reply). I had problems getting a trace as the issue seems to be very sensitive to anything that may disturb the timing between the two threads. However running strace with only the -f option, after a few iterations I was able to get a trace.
Thread 1 contains these three debug statements which should print two lines 'CHECK_IN' and 'CHECK_TEST' to the console:
print('CHECK IN')#DEBUG
u'hello'.encode('utf8')
print('CHECK TEST')#DEBUG
Here is the last page of the strace:
8605 mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb753d000
8605 mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb6d3c000
8605 mprotect(0xb6d3c000, 4096, PROT_NONE) = 0
8605 clone(child_stack=0xb753c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb753cbd8, {entry_number:6, base_addr:0xb753cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb753cbd8) = 8606
8606 set_robust_list(0xb753cbe0, 0xc <unfinished ...>
8605 futex(0xa239138, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... set_robust_list resumed> ) = 0
8606 futex(0xa239138, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 gettimeofday( <unfinished ...>
8605 futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... gettimeofday resumed> {1301528807, 326496}, NULL) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605 gettimeofday( <unfinished ...>
8606 <... futex resumed> ) = 0
8605 <... gettimeofday resumed> {1301528807, 326821}, NULL) = 0
8606 futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8605 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8606 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
8605 <... futex resumed> ) = 0
8606 gettimeofday( <unfinished ...>
8605 futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... gettimeofday resumed> {1301528807, 326908}, NULL) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605 futex(0xa1b0d70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... futex resumed> ) = 0
8606 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2225, ...}) = 0
8606 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
8606 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6d3b000
8606 write(1, "CHECK IN\n", 9) = 9
8606 futex(0xa115270, FUTEX_WAIT_PRIVATE, 0, NULL
And the output of the three lines of code is just the following, before the program locks up:
CHECK IN
So the strace shows how Thread 1 (#8606) writes the 'CHECK_IN' string, and when reaching the unicode.encode call enters a waiting state which never returns.
By the way I am making a couple future imports in all modules to keep with some newer python conventions ...
from __future__ import print_function, unicode_literals
... but I cannot see they should make any difference - especially as the u'hello' string is called out explicitly as a unicode string.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我在 Python 源代码中找不到任何会导致
unicode.encode()
阻塞的内容,并且我编写的用于尝试重现此运行的虚拟程序按预期运行。您提到线程 1 已获取超过 1 个锁 - 您是否已消除这些锁作为锁定源?下面的测试用例在您的环境中是否显示出相同的锁定情况?
输出:
您还可以在程序上运行
strace
来找出进程被阻止的位置。例如,使用上面的脚本:-o trace.out
标志告诉 strace 将输出写入文件。您可以省略它,strace 将打印到 stderr。trace.out
的内容应该显示程序进行的所有系统调用,每行都以线程 ID 为前缀以及系统调用之间的相对时间。该行的末尾将包含该系统调用内花费的时间。我用相应的 Python 代码注释了最后几个系统调用:I can find nothing in the Python source which would cause
unicode.encode()
to block, and the dummy program I wrote to try to reproduce this runs as expected. You mentioned that Thread 1 has acquired more than 1 lock - have you eliminated those locks as the source of the lock-up?Does the test case below show the same lock-up in your environment?
Output:
You can also run
strace
on your program to work out where the process is blocked. For example with the above script:The
-o trace.out
flag tells strace to write the output to a file. You can omit that and strace will print to stderr.The contents of
trace.out
should show you all of the system calls made by the program, with each line prefixed with thread id and the relative time between system calls. The end of the line will have the time spent inside that system call. I've annotated the last few syscalls with the corresponding Python code: