如何使用 emacs 查找并删除文本文件中的不可见字符

发布于 2024-12-09 02:15:58 字数 672 浏览 0 评论 0原文

我有一个名为 COPYING 的 .txt 文件,该文件是在 Windows 上编辑的。
它包含 Windows 风格的换行符:

$ file COPYING 
COPYING: ASCII English text, with CRLF line terminators

我尝试使用 dos2unix 将其转换为 Unix 风格。下面是输出:

$ dos2unix COPYING 
dos2unix: Skipping binary file COPYING

我惊讶地发现 dos2unix 程序将其报告为二进制文件。然后使用其他编辑器(不是 Emacs)我发现该文件包含一个控制字符。我有兴趣使用 Emacs 查找文件中的所有不可见字符。

通过谷歌搜索,我找到了以下使用 tr 的解决方案:

tr -cd '\11\12\40-\176' < file_name

我怎样才能以 Emacs 方式做同样的事情?我尝试了 Hexl 模式。 Hexl 模式在单个缓冲区中显示文本及其相应的 ASCII 值,这非常棒。如何查找 ASCII 值除 11-12、40-176 之外的字符(即制表符、空格和可见字符)?我尝试为该搜索创建一个正则表达式,但它非常复杂。

I have a .txt file named COPYING which is edited on windows.
It contains Windows-style line breaks :

$ file COPYING 
COPYING: ASCII English text, with CRLF line terminators

I tried to convert it to Unix style using dos2unix. Below is the output :

$ dos2unix COPYING 
dos2unix: Skipping binary file COPYING

I was surprised to find that the dos2unix program reports it as a binary file. Then using some other editor (not Emacs) I found that the file contains a control character. I am interested in finding all the invisible characters in the file using Emacs.

By googling, I have found the following solution which uses tr :

tr -cd '\11\12\40-\176' < file_name

How can I do the same in an Emacs way? I tried the Hexl mode. The Hexl mode shows text and their corresponding ASCII values in a single buffer which is great. How do I find the characters which have ASCII values other than 11-12, 40-176 (i.e tab, space, and visible characters)? I tried to create a regular expression for that search, but it is quite complicated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

肩上的翅膀 2024-12-16 02:15:58

要查看不可见字符,您可以尝试whitespace-mode。空格和制表符将以不同面的符号显示。如果编码系统被自动检测为 dos(状态栏上显示 (DOS)),则行尾的回车符也将被隐藏。运行 revert-buffer-with-coding-system 将其切换到 Unix 或二进制(例如 Cx RET r unix),它们将始终显示为 ^ M。二进制编码系统也将显示任何非 ASCII 字符作为控制字符。

To see invisible characters, you can try whitespace-mode. Spaces and tabs will be displayed with a symbol in a different face. If the coding system is automatically being detected as dos (showing (DOS) on the status bar), carriage returns at the end of a line will be hidden as well. Run revert-buffer-with-coding-system to switch it to Unix or binary (e.g. C-x RET r unix) and they'll always show up as ^M. The binary coding system will display any non-ASCII characters as control characters as well.

千纸鹤带着心事 2024-12-16 02:15:58

默认情况下,Emacs 不会隐藏任何字符。按 Ctrl+Meta+%,或 Esc,然后按 Ctrl+ %(如果前者对您的手指来说太难),或者如果您愿意的话,可以使用 Mx Replace-regexp RET 。然后,对于正则表达式,输入

[^@-^H^K-^_^?]

“但是,在我编写 ^H 的位置”,输入 Ctrl+Q,然后输入 Ctrl +H,按字面意思输入“control-H”字符,其他字符类似。您可以按 Ctrl+Q,然后按 Ctrl+Space 来表示 ^@,然后通常是 Ctrl+Q,然后 Backspace 表示 ^?。将所有出现此正则表达式的地方替换为空字符串。

由于您已在 Emacs 中打开该文件,因此您可以在使用该文件时更改其行结尾。按 Cx RET f (Ctrl+X Return F) 并输入 us-ascii-unix 作为文件的新所需编码。

Emacs won't hide any character by default. Press Ctrl+Meta+%, or Esc then Ctrl+% if the former is too hard on your fingers, or M-x replace-regexp RET if you prefer. Then, for the regular expression, enter

[^@-^H^K-^_^?]

However, where I wrote ^H, type Ctrl+Q then Ctrl+H, to enter a “control-H” character literally, and similarly for the others. You can press Ctrl+Q then Ctrl+Space for ^@, and usually Ctrl+Q then Backspace for ^?. Replace all occurrences of this regular expression by the empty string.

Since you have the file open in Emacs, you can change its line endings while you're at it. Press C-x RET f (Ctrl+X Return F) and enter us-ascii-unix as the new desired encoding for the file.

初懵 2024-12-16 02:15:58

查看Mx set-buffer-file-coding-system。从文档中:

(设置缓冲区文件编码系统 CODING-SYSTEM 和可选的 FORCE NOMODIFY)

将当前缓冲区的文件编码系统设置为CODING-SYSTEM。
这意味着当您保存缓冲区时,它将被转换
根据编码系统。对于可能值的列表
编码系统,使用 Mx list-coding-systems。

因此,从 DOS 到 UNIX,Mx set-buffer-file-coding-system unix

Check out M-x set-buffer-file-coding-system. From the documentation:

(set-buffer-file-coding-system CODING-SYSTEM &optional FORCE NOMODIFY)

Set the file coding-system of the current buffer to CODING-SYSTEM.
This means that when you save the buffer, it will be converted
according to CODING-SYSTEM. For a list of possible values of
CODING-SYSTEM, use M-x list-coding-systems.

So, going from DOS to UNIX, M-x set-buffer-file-coding-system unix.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文