Emacs、unicode、xterm 鼠标转义序列和宽终端

发布于 2024-09-14 12:09:31 字数 4531 浏览 2 评论 0原文

简短版本:当使用 emacs 的 xterm-mouse-mode 时,有人(emacs?bash?xterm?)拦截 xterm 的控制序列并将其替换为 \0。这对于宽显示器来说是一个痛苦,因为只有前 223 列有鼠标。

罪魁祸首是什么?我该如何解决它?

据我所知,这与 Unicode/UTF-8 支持有关,因为 5-6 年前当我最后一次拥有大显示器时,这不是问题。

血淋淋的细节如下...

谢谢!

Emacs xterm-mouse-mode 在处理从 x=95 左右开始的鼠标单击时有一个众所周知的弱点。最新版本的 emacs 采用的一种解决方法解决了该问题到x=223。

几年前,我发现 xterm 以 7 位八位组的形式对位置进行编码。给定要编码的位置“x”,其中 X=x-96,发送:

\40+x (x < 96)  
\300+X/64 \200+X%64 (otherwise)  

我们必须向 emacs 中给定的 x 位置加一,因为 xterm 中的位置从 1 开始,而不是从 0 开始。因此,神奇的 x=95 数字就会弹出,因为它被编码为“\300\200”——第一个转义数字。有人(emacs?bash?xterm?)将它们视为ISO 2022中的“C0”控制序列。从 x=159 开始,我们更改为“C1”序列 (\301\200),这也是 ISO 2022 的一部分。

\302 序列遇到问题,它对应于当前的 x=223 限制。几年前,我能够扩展 hack 来手动拦截 \302 和 \303 序列,这解决了这个问题。快进几年,今天我发现我被困在 x=223 处,因为有人用 \0 替换这些序列。

因此,我希望单击第 1 行第 250 行生成

ESC [ M SPC \303\207 ! ESC [ M # \303\207 !

Emacs 报告(对于任何第 223 列),

ESC [ M SPC C-@ ! ESC [ M # C-@ !

我怀疑 Unicode/UTF-8 支持是罪魁祸首。一些挖掘表明,直到 2000 年 11 月,Unicode 标准才允许 C0 和 C1 序列作为 UTF-8 的一部分 ,我猜有人没有收到备忘录(幸运的是)。但是,\302\200 - \302\237 是 Unicode 控制序列 ,所以有人把它们吞掉了(谁知道用它们做什么!)并返回 \0 。

一些更详细的问题:
- 在代码到达 emacs 的丢失缓冲区之前拦截代码的人是谁?
- 如果它真的只是关于控制序列,为什么 \302\237 之后的字符(可打印 Unicode 的 UTF-8 编码)也返回为 \0 ?
- 是什么让 emacs 决定是否将丢失显示为 unicode 字符或八进制转义序列,以及为什么两者不匹配?例如,我自建的 cygwin emacs 23.2.1 (xterm 229) 报告第 161 列为 \301\202,但我的 rhel5.5 提供的 emacs 22.3.1 (xterm 215) 报告“”(带有抑扬符的拉丁语 A) ,实际上是 UTF-8 中的 \303\202!

更新:

这是一个针对 xterm-261 的补丁,它可以以 utf-8 格式发出鼠标位置:

diff -r button.c button.utf-8-fix.c
--- a/button.c  Sat Aug 14 08:23:00 2010 +0200
+++ b/button.c  Thu Aug 26 16:16:48 2010 +0200
@@ -3994,1 +3994,27 @@
-#define MOUSE_LIMIT (255 - 32)
+#define MOUSE_LIMIT (2047 - 32)
+#define MOUSE_UTF_8_START (127 - 32)
+
+static unsigned
+EmitMousePosition(Char line[], unsigned count, int value)
+{
+    /* Add pointer position to key sequence
+     * 
+     * Encode large positions as two-byte UTF-8 
+     *
+     * NOTE: historically, it was possible to emit 256, which became
+     * zero by truncation to 8 bits. While this was arguably a bug,
+     * it's also somewhat useful as a past-end marker so we keep it.
+     */
+    if(value == MOUSE_LIMIT) {
+       line[count++] = CharOf(0);
+    }
+    else if(value < MOUSE_UTF_8_START) {
+       line[count++] = CharOf(' ' + value + 1);
+    }
+    else {
+       value += ' ' + 1;
+       line[count++] = CharOf(0xC0 + (value >> 6));
+       line[count++] = CharOf(0x80 + (value & 0x3F));
+    }
+    return count;
+}
@@ -4001,1 +4027,1 @@
-    Char line[6];
+    Char line[9]; /* \e [ > M Pb Pxh Pxl Pyh Pyl */
@@ -4021,2 +4047,0 @@
-    else if (row > MOUSE_LIMIT)
-       row = MOUSE_LIMIT;
@@ -4028,1 +4052,5 @@
-    else if (col > MOUSE_LIMIT)
+
+    /* Limit to representable mouse dimensions */
+    if (row > MOUSE_LIMIT)
+       row = MOUSE_LIMIT;
+    if (col > MOUSE_LIMIT)
@@ -4090,2 +4118,2 @@
-       line[count++] = CharOf(' ' + col + 1);
-       line[count++] = CharOf(' ' + row + 1);
+       count = EmitMousePosition(line, count, col);
+       count = EmitMousePosition(line, count, row);

希望这个(或类似的东西)会出现在 xterm 的未来版本中...该补丁使xterm 可以与 emacs-23(假定 utf-8 输入)一起使用,并且还修复了 xt-mouse.el 的现有问题。要与 emacs-22 一起使用,需要重新定义用于解码鼠标位置的函数(新定义也适用于 emacs-23):

(defadvice xterm-mouse-event-read (around utf-8 compile activate)
  (setq ad-return-value
        (let ((c (read-char)))
          (cond
           ;; mouse clicks outside the encodable range produce 0
           ((= c 0) #x800)
           ;; must convert UTF-8 to unicode ourselves
           ((and (>= c #xC2) (< emacs-major-version 23))
            (logior (lsh (logand c #x1F) 6) (logand (read-char) #x3F)))
           ;; normal case
           (c) ) )))

将 defun 作为 .emacs 的一部分分发到您登录的所有计算机上,并修补您工作的任何机器上的 xterm。瞧!

警告:使用 xterm 鼠标模式但不将其输入视为 utf-8 的应用程序将被此补丁混淆,因为鼠标转义序列变得更长。然而,这些应用程序会严重破坏当前的 xterm,因为 x > 的鼠标位置会严重破坏。 95 看起来像 utf-8 代码,但其实不然。我会为 xterm 创建一个新的鼠标模式,但某些应用程序(gnu 屏幕!)会过滤掉未知的转义序列。 Emacs 是我使用的唯一终端鼠标应用程序,所以我认为该补丁是一个净胜,但是 YMMV。

Short version: When using emacs' xterm-mouse-mode, Somebody (emacs? bash? xterm?) intercepts xterm's control sequences and replaces them with \0. This is a pain on wide monitors because only the first 223 columns have mouse.

What is the culprit, and how can I work around it?

From what I can tell this has something to do with Unicode/UTF-8 support, because it wasn't a problem 5-6 years ago when I last had a big monitor.

Gory details follow...

Thanks!

Emacs xterm-mouse-mode has a well-known weakness handling mouse clicks starting around x=95. A workaround, adopted by recent versions of emacs, pushes the problem off to x=223.

Several years ago I figured out that xterm encodes positions in 7-bit octets. Given position 'x' to encode, with X=x-96, send:

\40+x (x < 96)  
\300+X/64 \200+X%64 (otherwise)  

We have to add one to given x position from emacs, because positions in xterm start at one, not zero. Hence the magic x=95 number pops up because it's coded as "\300\200" -- the first escaped number. Somebody (emacs? bash? xterm?) treats those like "C0" control sequences from ISO 2022. Starting at x=159, we change to "C1" sequences (\301\200), which are also part of ISO 2022.

Trouble hits with \302 sequences, which corresponds to the current x=223 limit. Several years ago I was able to extend the hack to intercept \302 and \303 sequences manually, which got past the problem. Fast forward a few years, and today I find that I'm stuck back at x=223 because Somebody is replacing those sequences with \0.

So, where I'd expect clicking at line 1, col 250 to produce

ESC [ M SPC \303\207 ! ESC [ M # \303\207 !

Instead emacs reports (for any col > 223)

ESC [ M SPC C-@ ! ESC [ M # C-@ !

I suspect that Unicode/UTF-8 support is the culprit. Some digging shows that the Unicode standard allowed C0 and C1 sequences as part of UTF-8 until Nov 2000, and I guess Somebody didn't get the memo (fortunately). However, \302\200 - \302\237 are Unicode control sequences, so Somebody slurps them up (doing who-knows-what with them!) and returns \0 instead.

Some more detailed questions:
- Who is this Somebody that intercepts the codes before they reach emacs' lossage buffer?
- If it's really just about control sequences, how come characters after \302\237, which are UTF-8 encodings of printable Unicode, also come back as \0 ?
- What makes emacs decide whether to display lossage as unicode characters or octal escape sequences, and why don't the two match? For example, my self-built cygwin emacs 23.2.1 (xterm 229) reports \301\202 for column 161, but my rhel5.5-supplied emacs 22.3.1 (xterm 215) reports "Â" (latin A with circumflex), which is actually \303\202 in UTF-8!

Update:

Here's a patch against xterm-261 which makes it emit mouse positions in utf-8 format:

diff -r button.c button.utf-8-fix.c
--- a/button.c  Sat Aug 14 08:23:00 2010 +0200
+++ b/button.c  Thu Aug 26 16:16:48 2010 +0200
@@ -3994,1 +3994,27 @@
-#define MOUSE_LIMIT (255 - 32)
+#define MOUSE_LIMIT (2047 - 32)
+#define MOUSE_UTF_8_START (127 - 32)
+
+static unsigned
+EmitMousePosition(Char line[], unsigned count, int value)
+{
+    /* Add pointer position to key sequence
+     * 
+     * Encode large positions as two-byte UTF-8 
+     *
+     * NOTE: historically, it was possible to emit 256, which became
+     * zero by truncation to 8 bits. While this was arguably a bug,
+     * it's also somewhat useful as a past-end marker so we keep it.
+     */
+    if(value == MOUSE_LIMIT) {
+       line[count++] = CharOf(0);
+    }
+    else if(value < MOUSE_UTF_8_START) {
+       line[count++] = CharOf(' ' + value + 1);
+    }
+    else {
+       value += ' ' + 1;
+       line[count++] = CharOf(0xC0 + (value >> 6));
+       line[count++] = CharOf(0x80 + (value & 0x3F));
+    }
+    return count;
+}
@@ -4001,1 +4027,1 @@
-    Char line[6];
+    Char line[9]; /* \e [ > M Pb Pxh Pxl Pyh Pyl */
@@ -4021,2 +4047,0 @@
-    else if (row > MOUSE_LIMIT)
-       row = MOUSE_LIMIT;
@@ -4028,1 +4052,5 @@
-    else if (col > MOUSE_LIMIT)
+
+    /* Limit to representable mouse dimensions */
+    if (row > MOUSE_LIMIT)
+       row = MOUSE_LIMIT;
+    if (col > MOUSE_LIMIT)
@@ -4090,2 +4118,2 @@
-       line[count++] = CharOf(' ' + col + 1);
-       line[count++] = CharOf(' ' + row + 1);
+       count = EmitMousePosition(line, count, col);
+       count = EmitMousePosition(line, count, row);

Hopefully this (or something like it) will appear in a future version of xterm... the patch makes xterm work out of the box with emacs-23 (which assumes utf-8 input) and fixes the existing problems with xt-mouse.el also. To use it with emacs-22 requires a redefinition of the function it uses to decode mouse positions (the new definition works fine with emacs-23 also):

(defadvice xterm-mouse-event-read (around utf-8 compile activate)
  (setq ad-return-value
        (let ((c (read-char)))
          (cond
           ;; mouse clicks outside the encodable range produce 0
           ((= c 0) #x800)
           ;; must convert UTF-8 to unicode ourselves
           ((and (>= c #xC2) (< emacs-major-version 23))
            (logior (lsh (logand c #x1F) 6) (logand (read-char) #x3F)))
           ;; normal case
           (c) ) )))

Distribute the defun as part of the .emacs on all machines you log into, and patch the xterm on any machines you work from. Voila!

WARNING: Applications which use xterm's mouse modes but do not treat their input as utf-8 will get confused by this patch because the mouse escape sequences get longer. However, those applications break horribly with the current xterm because mouse positions with x > 95 look like utf-8 codes but aren't. I'd create a new mouse mode for xterm, but certain applications (gnu screen!) filter out unknown escape sequences. Emacs is the only terminal-mouse app I use, so I consider the patch a net win, but YMMV.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

以可爱出名 2024-09-21 12:09:31

xterm-262 添加了上面内联的补丁,但是,这个补丁在设计上完全被破坏了。 Rxvt-unicode 的开发人员意识到了这一点,并添加了另一个更好的扩展来报告鼠标坐标。

目前我正在努力争取对此的广泛支持。 Rxvt-unicodeiTerm2 已经支持这两个扩展。我为 xterm 创建了补丁(以支持 urxvt 扩展),并为 gnome-terminalkonsole 和 < code>putty 支持这两个新扩展。至于应用程序,我在 Midnight Commander 中添加了对 urxvt 扩展的支持。

请和我一起努力,尝试说服更多的终端开发者和应用程序实现这些扩展(至少是 urxvt 一个,因为另一个不能被应用程序正确自动识别)。

请参阅 http://www.midnight-commander.org/ticket/2662 了解技术详情以及进一步的指示。

xterm-262 adds the patch inlined above, however, this patch quite is broken by design. Rxvt-unicode's developers realized it and added yet another, much better extension to report mouse coordinates.

Right now I'm working on getting widespread support for this. Rxvt-unicode and iTerm2 already support both extensions. I created patches for xterm (to support the urxvt extension), and for gnome-terminal, konsole and putty to support both new extension. As for the applications, I've added support for the urxvt extension to Midnight Commander.

Please join me in my effort and try to convince more terminal developers and applications to implement these extensions (at least the urxvt one, because the other one can't be properly automatically recognized by applications).

See http://www.midnight-commander.org/ticket/2662 for technical details and further pointers.

冰之心 2024-09-21 12:09:31

好吧,想通了。实际上有两个问题。

首先,一些源码分析显示 xterm 将窗口的鼠标启用区域裁剪为 223x223 个字符,并为所有其他位置发送 0x0。

其次,emacs-23 能够识别 UTF-8,并且会被 x>160 和 y>94 的鼠标事件混淆;在这些情况下,xterm 对 x 和 y 的编码看起来像一个两字节的 UTF-8 字符(例如 0xC2 0x80),因此鼠标序列似乎短了一个字符。

我正在为 xterm 开发一个补丁,使鼠标事件发出 UTF-8(这既可以消除 emacs-23 的混淆,又允许终端高达 2047x2047),但我还不确定结果如何。

OK, figured it out. There are actually two issues.

First, some source diving shows that xterm clips the mouse-enabled region of the window to 223x223 chars, and sends 0x0 for all other positions.

Second, emacs-23 is UTF-8 aware and gets confused by mouse events having x>160 and y>94; in those cases xterm's encoding for x and y looks like a two-byte UTF-8 character (e.g. 0xC2 0x80) and as a result the mouse sequence seems one character short.

I'm working on a patch for xterm to make mouse events emit UTF-8 (which would both unconfuse emacs-23 and allow terminals up to 2047x2047), but I'm not sure yet how it will turn out.

说好的呢 2024-09-21 12:09:31

我认为导致您的解决方法(以及 v22 版本之一中包含的上游修复)在 23.2 中停止工作的问题出在 Emacs 本身内部。 23.1 可以使用 urxvt、gnu screen、putty 或 iTerm 处理第 95 列之后的鼠标点击,但 23.2 不能。将所有设置设置为 latin-1 没有什么区别。 23.1 在 xt-mouse.el 中有相同的代码。然而,src/lread.c 和 src/character.h 发生了变化,乍一看,我猜这个错误就在某个地方。至于第223栏之后会发生什么,我不知道。

为了让其他对 23.2 中的 xt-mouse 回归感到恼火的人受益,这里有一个 xterm-mouse-event-read 的修改版本,它可以使用高达第 222 列的鼠标单击(归功于 Ryan 的 >222 溢出处理,我的缺少原始修复)。这可能在 23.1 或更早版本中不起作用。

(defun xterm-mouse-event-read ()
  (let ((c (read-char)))
    (cond ((= c 0) #x100)  
       ; for positions past col 222 emacs just delivers
       ; 0x0, best we can do is stay at eol 
      ((= 0 (logand c (- #x100))) c) 
      ((logand c #xff))))) 

... 编辑:
这是 Emacs 24 的版本(bzr head)。它在 23.2 到第 222 列中再次工作,但缺少 >222 溢出 eol 处理 Ryan 建议:

(defun xterm-mouse-event-read ()
  (let ((c (read-char)))
    (if (> c #x3FFF80)
        (+ 128 (- c #x3FFF80))
      c)))

I think the problem that caused your workaround (and the upstream fix that was included in one of the v22 releases) to stop working in 23.2 is within Emacs itself. 23.1 can handle mouse clicks after column 95 using urxvt, gnu screen, putty or iTerm, but 23.2 can't. Setting everything set to latin-1 makes no difference. 23.1 has the same code in xt-mouse.el. src/lread.c and src/character.h changed, however, and at a glance I'd guess the bug is in there somewhere. As to what happens after column 223, I've got no clue.

For the benefit of anyone else who's annoyed by the xt-mouse regression in 23.2 here's a modified version of xterm-mouse-event-read that works with mouse clicks up to col 222 (credit to Ryan for the >222 overflow handling which my original fix lacked). This probably won't work in 23.1 or before.

(defun xterm-mouse-event-read ()
  (let ((c (read-char)))
    (cond ((= c 0) #x100)  
       ; for positions past col 222 emacs just delivers
       ; 0x0, best we can do is stay at eol 
      ((= 0 (logand c (- #x100))) c) 
      ((logand c #xff))))) 

... Edit:
Here's the version from Emacs 24 (bzr head). It works again in 23.2 up to col 222, but lacks the >222 overflow eol handling Ryan suggested:

(defun xterm-mouse-event-read ()
  (let ((c (read-char)))
    (if (> c #x3FFF80)
        (+ 128 (- c #x3FFF80))
      c)))
如梦 2024-09-21 12:09:31

虽然 xterm 现在通过补丁在 utf-8 模式下工作,但这种 utf-8 hack 将以最糟糕的方式在任何其他语言环境中进行破坏,因为 unicode 字符将被删除,除非可表示。

rxvt-unicode(在 9.09 之后的版本中)具有 1015 模式,该模式使用十进制数字发送“ESC [ code ; x ; y M”形式的回复。这样做的优点是不需要应用程序进行任何探测,并且也可以在非 utf-8 语言环境中工作。

While xterm now works in utf-8 mode with a patch, this utf-8 hack will break in the worst possible way in any other locale, as the unicode characters will just be dropped unless representable.

rxvt-unicode has (in releases after 9.09) a 1015 mode that sends replies of the form "ESC [ code ; x ; y M", using decimal numbers. This has the advantage of not needing any probing from apps and also working in non-utf-8 locales.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文