如何解决/绕过似乎与 GIL 相关的锁定问题

发布于 2024-10-27 11:44:50 字数 4060 浏览 4 评论 0原文

两个线程之间的模糊锁定似乎与全局解释器锁或其他一些“幕后锁”有关，我不知道如何继续进行故障排除。任何有关如何消除锁定的提示将不胜感激。

该问题在更大的代码集中重现（不稳定且有些随机）。代码严格来说是Python的。 Python 版本是 2.6.5（在 Linux 上）。经过数小时的故障排除，已将发生锁定时的问题减少到以下情况：

程序只有两个正在运行的线程线程
同时调用受单线程保护的方法。RLock
线程 1 已通过以下方式获取锁 [加上一些其他锁] acquire()
线程 2 已调用 acquire() 并被确认正在等待锁
线程 1 能够使用 print() 打印到控制台，但是它被一个简单的非阻塞库调用

锁定调用 #5 是函数 unicode.encode，它应该是非阻塞的。线程 1 中线程锁定位置的以下代码将（按预期）打印“A”和“B”：

print('A')
print('B')

但是，以下代码只会打印“A”并阻塞线程：

print('A')
u'hello'.encode('utf8') # This dummy (non-blocking) call locks up Thread 1
print('B')

这对我来说根本没有意义。两个线程之间不存在逻辑死锁情况。线程 1 被非阻塞库调用阻塞，该库调用不会以任何方式干扰线程 2，线程 2 只是静静地等待获取 RLock。我能想到线程 1 被阻塞的唯一原因是它正在等待 GIL。

有什么想法如何进一步解决此问题，或者有任何机制可以以某种方式控制或操纵 GIL 操作作为解决方法吗？

编辑：针对样本偏差的一些附加信息（并感谢您的回复）。我在获取跟踪时遇到问题，因为该问题似乎对任何可能干扰两个线程之间的计时的事物都非常敏感。然而，仅使用 -f 选项运行 strace，经过几次迭代后我能够获得跟踪。

线程 1 包含这三个调试语句，应在控制台上打印两行“CHECK_IN”和“CHECK_TEST”：

print('CHECK IN')#DEBUG
u'hello'.encode('utf8')
print('CHECK TEST')#DEBUG

这是 strace 的最后一页：

8605  mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb753d000
8605  mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb6d3c000
8605  mprotect(0xb6d3c000, 4096, PROT_NONE) = 0
8605  clone(child_stack=0xb753c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb753cbd8, {entry_number:6, base_addr:0xb753cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb753cbd8) = 8606
8606  set_robust_list(0xb753cbe0, 0xc <unfinished ...>
8605  futex(0xa239138, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... set_robust_list resumed> )   = 0
8606  futex(0xa239138, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  gettimeofday( <unfinished ...>
8605  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... gettimeofday resumed> {1301528807, 326496}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605  gettimeofday( <unfinished ...>
8606  <... futex resumed> )             = 0
8605  <... gettimeofday resumed> {1301528807, 326821}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8605  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8606  <... futex resumed> )             = -1 EAGAIN (Resource temporarily unavailable)
8605  <... futex resumed> )             = 0
8606  gettimeofday( <unfinished ...>
8605  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... gettimeofday resumed> {1301528807, 326908}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605  futex(0xa1b0d70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... futex resumed> )             = 0
8606  stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2225, ...}) = 0
8606  fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
8606  mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6d3b000
8606  write(1, "CHECK IN\n", 9)         = 9
8606  futex(0xa115270, FUTEX_WAIT_PRIVATE, 0, NULL

在程序锁定之前，这三行代码的输出如下up：

CHECK IN

所以 strace 显示了线程 1 (#8606) 如何写入“CHECK_IN”字符串，以及当到达 unicode.encode 调用时如何进入永远不会返回的等待状态。

顺便说一句，我正在所有模块中进行一些未来的导入，以符合一些较新的 python 约定……

from __future__ import print_function, unicode_literals

但我看不出它们应该有任何区别 - 特别是当 u'hello' 字符串被显式调用为统一码字符串。

原文

An obscure lock-up between two threads seems to be related to the Global Interpreter Lock or some other "behind-the-scenes-lock", and I am at a loss how to continue troubleshooting. Any tips how to eliminate the lock-up would be appreciated.

The issue reproduces (erratically and somewhat random) in a larger set of code. The code is strictly python. Python version is 2.6.5 (on Linux). Hours of troubleshooting has reduced the problem when the lock-up occurs to the following:

The program has only two running threads
The threads concurrently call a method protected by a single threading.RLock
Thread 1 has acquired the lock [plus some other locks] via acquire()
Thread 2 has called acquire() and is confirmed to be waiting for the lock
Thread 1 is able to do print to the console with print(), however it is getting locked up by a simple non-blocking library call

The offensive call in #5 is the function unicode.encode, which should be non-blocking. The following code in Thread 1 at the position where the thread locks will (as expected) print 'A' and 'B':

print('A')
print('B')

However, the following will just print 'A' and block the thread:

print('A')
u'hello'.encode('utf8') # This dummy (non-blocking) call locks up Thread 1
print('B')

This makes no sense to me at all. There is no logical dead-lock condition existing between the two threads. Thread 1 is being blocked by a non-blocking library call that is not in any way interfering with Thread 2, which is just silently waiting to acquire an RLock. The only cause I can think of for Thread 1 being blocked is that it is waiting for the GIL.

Any thoughts how to troubleshoot this further, or any mechanisms to somehow control or manipulate GIL operation as a workaround?

Edit: Some additional information in response to samplebias (and thank you for the reply). I had problems getting a trace as the issue seems to be very sensitive to anything that may disturb the timing between the two threads. However running strace with only the -f option, after a few iterations I was able to get a trace.

Thread 1 contains these three debug statements which should print two lines 'CHECK_IN' and 'CHECK_TEST' to the console:

print('CHECK IN')#DEBUG
u'hello'.encode('utf8')
print('CHECK TEST')#DEBUG

Here is the last page of the strace:

8605  mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb753d000
8605  mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb6d3c000
8605  mprotect(0xb6d3c000, 4096, PROT_NONE) = 0
8605  clone(child_stack=0xb753c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb753cbd8, {entry_number:6, base_addr:0xb753cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb753cbd8) = 8606
8606  set_robust_list(0xb753cbe0, 0xc <unfinished ...>
8605  futex(0xa239138, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... set_robust_list resumed> )   = 0
8606  futex(0xa239138, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  gettimeofday( <unfinished ...>
8605  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... gettimeofday resumed> {1301528807, 326496}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605  gettimeofday( <unfinished ...>
8606  <... futex resumed> )             = 0
8605  <... gettimeofday resumed> {1301528807, 326821}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8605  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8606  <... futex resumed> )             = -1 EAGAIN (Resource temporarily unavailable)
8605  <... futex resumed> )             = 0
8606  gettimeofday( <unfinished ...>
8605  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... gettimeofday resumed> {1301528807, 326908}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605  futex(0xa1b0d70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... futex resumed> )             = 0
8606  stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2225, ...}) = 0
8606  fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
8606  mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6d3b000
8606  write(1, "CHECK IN\n", 9)         = 9
8606  futex(0xa115270, FUTEX_WAIT_PRIVATE, 0, NULL

And the output of the three lines of code is just the following, before the program locks up:

CHECK IN

So the strace shows how Thread 1 (#8606) writes the 'CHECK_IN' string, and when reaching the unicode.encode call enters a waiting state which never returns.

By the way I am making a couple future imports in all modules to keep with some newer python conventions ...

from __future__ import print_function, unicode_literals

... but I cannot see they should make any difference - especially as the u'hello' string is called out explicitly as a unicode string.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽你眉间 2024-11-03 11:44:50

我在 Python 源代码中找不到任何会导致 unicode.encode() 阻塞的内容，并且我编写的用于尝试重现此运行的虚拟程序按预期运行。您提到线程 1 已获取超过 1 个锁 - 您是否已消除这些锁作为锁定源？

下面的测试用例在您的环境中是否显示出相同的锁定情况？

import time
import threading

def worker(tid):
    _lock.acquire()
    if not tid:
        # wait for rest of threads to enter acquire
        time.sleep(0.5)    
    print('%d: A' % tid)
    u'hello'.encode('utf-8')
    print('%d: B' % tid)
    _lock.release()

def start(tid):
    th = threading.Thread(target=worker, args=(tid,))
    th.start()
    return th

_num = 2
_lock = threading.RLock()
workers = [start(n) for n in range(_num)]
while all(w.isAlive() for w in workers):
    time.sleep(1)

输出：

0: A
0: B
1: A
1: B

您还可以在程序上运行 strace 来找出进程被阻止的位置。例如，使用上面的脚本：

% strace -fTr -o trace.out python lockup.py

-o trace.out 标志告诉 strace 将输出写入文件。您可以省略它，strace 将打印到 stderr。

trace.out 的内容应该显示程序进行的所有系统调用，每行都以线程 ID 为前缀以及系统调用之间的相对时间。该行的末尾将包含该系统调用内花费的时间。我用相应的 Python 代码注释了最后几个系统调用：

# thread 0 time.sleep(0.5) completes
24778      0.500124 <... select resumed> ) = 0 (Timeout) <0.500599>
# preparing to print()
24778      0.000071 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000017>
24778      0.000058 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8fe90a6000 <0.000018>
# print("0: A\n")..
24778      0.000079 write(1, "0: A\n", 5) = 5 <0.000023>
24778      0.000106 write(1, "0: B\n", 5) = 5 <0.000056>
# thread 0 _lock.release()
24778      0.000114 futex(0xe0f3c0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000024>
24778      0.000108 madvise(0x7f8fe7266000, 8368128, MADV_DONTNEED) = 0 <0.000030>
# thread 0 exit
24778      0.000072 _exit(0)            = ?
# thread 1 _lock.acquire()
24779      0.000050 <... futex resumed> ) = 0 <0.500774>
# thread 1 print("1: A\n") and so on..
24779      0.000052 write(1, "1: A\n", 5) = 5 <0.000026>
24779      0.000086 write(1, "1: B\n", 5) = 5 <0.000026>
24779      0.000099 madvise(0x7f8fe6a65000, 8368128, MADV_DONTNEED) = 0 <0.000024>
24779      0.000064 _exit(0)            = ?
24777      0.499956 <... select resumed> ) = 0 (Timeout) <1.001138>
24777      0.000132 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f8fe8c7c8f0}, {0x4d9a90, [], SA_RESTORER, 0x7f8fe8c7c8f0}, 8) = 0 <0.000025>
# main thread process exit
24777      0.002349 exit_group(0)       = ?

I can find nothing in the Python source which would cause unicode.encode() to block, and the dummy program I wrote to try to reproduce this runs as expected. You mentioned that Thread 1 has acquired more than 1 lock - have you eliminated those locks as the source of the lock-up?

Does the test case below show the same lock-up in your environment?

import time
import threading

def worker(tid):
    _lock.acquire()
    if not tid:
        # wait for rest of threads to enter acquire
        time.sleep(0.5)    
    print('%d: A' % tid)
    u'hello'.encode('utf-8')
    print('%d: B' % tid)
    _lock.release()

def start(tid):
    th = threading.Thread(target=worker, args=(tid,))
    th.start()
    return th

_num = 2
_lock = threading.RLock()
workers = [start(n) for n in range(_num)]
while all(w.isAlive() for w in workers):
    time.sleep(1)

Output:

0: A
0: B
1: A
1: B

You can also run strace on your program to work out where the process is blocked. For example with the above script:

% strace -fTr -o trace.out python lockup.py

The -o trace.out flag tells strace to write the output to a file. You can omit that and strace will print to stderr.

The contents of trace.out should show you all of the system calls made by the program, with each line prefixed with thread id and the relative time between system calls. The end of the line will have the time spent inside that system call. I've annotated the last few syscalls with the corresponding Python code:

# thread 0 time.sleep(0.5) completes
24778      0.500124 <... select resumed> ) = 0 (Timeout) <0.500599>
# preparing to print()
24778      0.000071 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000017>
24778      0.000058 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8fe90a6000 <0.000018>
# print("0: A\n")..
24778      0.000079 write(1, "0: A\n", 5) = 5 <0.000023>
24778      0.000106 write(1, "0: B\n", 5) = 5 <0.000056>
# thread 0 _lock.release()
24778      0.000114 futex(0xe0f3c0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000024>
24778      0.000108 madvise(0x7f8fe7266000, 8368128, MADV_DONTNEED) = 0 <0.000030>
# thread 0 exit
24778      0.000072 _exit(0)            = ?
# thread 1 _lock.acquire()
24779      0.000050 <... futex resumed> ) = 0 <0.500774>
# thread 1 print("1: A\n") and so on..
24779      0.000052 write(1, "1: A\n", 5) = 5 <0.000026>
24779      0.000086 write(1, "1: B\n", 5) = 5 <0.000026>
24779      0.000099 madvise(0x7f8fe6a65000, 8368128, MADV_DONTNEED) = 0 <0.000024>
24779      0.000064 _exit(0)            = ?
24777      0.499956 <... select resumed> ) = 0 (Timeout) <1.001138>
24777      0.000132 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f8fe8c7c8f0}, {0x4d9a90, [], SA_RESTORER, 0x7f8fe8c7c8f0}, 8) = 0 <0.000025>
# main thread process exit
24777      0.002349 exit_group(0)       = ?

回复收藏 0 原文

~没有更多了~