如何使用 emacs 查找并删除文本文件中的不可见字符
我有一个名为 COPYING 的 .txt
文件,该文件是在 Windows 上编辑的。
它包含 Windows 风格的换行符:
$ file COPYING
COPYING: ASCII English text, with CRLF line terminators
我尝试使用 dos2unix 将其转换为 Unix 风格。下面是输出:
$ dos2unix COPYING
dos2unix: Skipping binary file COPYING
我惊讶地发现 dos2unix 程序将其报告为二进制文件。然后使用其他编辑器(不是 Emacs)我发现该文件包含一个控制字符。我有兴趣使用 Emacs 查找文件中的所有不可见字符。
通过谷歌搜索,我找到了以下使用 tr
的解决方案:
tr -cd '\11\12\40-\176' < file_name
我怎样才能以 Emacs 方式做同样的事情?我尝试了 Hexl 模式。 Hexl 模式在单个缓冲区中显示文本及其相应的 ASCII 值,这非常棒。如何查找 ASCII 值除 11-12、40-176 之外的字符(即制表符、空格和可见字符)?我尝试为该搜索创建一个正则表达式,但它非常复杂。
I have a .txt
file named COPYING which is edited on windows.
It contains Windows-style line breaks :
$ file COPYING
COPYING: ASCII English text, with CRLF line terminators
I tried to convert it to Unix style using dos2unix
. Below is the output :
$ dos2unix COPYING
dos2unix: Skipping binary file COPYING
I was surprised to find that the dos2unix
program reports it as a binary file. Then using some other editor (not Emacs) I found that the file contains a control character. I am interested in finding all the invisible characters in the file using Emacs.
By googling, I have found the following solution which uses tr
:
tr -cd '\11\12\40-\176' < file_name
How can I do the same in an Emacs way? I tried the Hexl mode. The Hexl mode shows text and their corresponding ASCII values in a single buffer which is great. How do I find the characters which have ASCII values other than 11-12, 40-176 (i.e tab, space, and visible characters)? I tried to create a regular expression for that search, but it is quite complicated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
要查看不可见字符,您可以尝试
whitespace-mode
。空格和制表符将以不同面的符号显示。如果编码系统被自动检测为 dos(状态栏上显示(DOS)
),则行尾的回车符也将被隐藏。运行revert-buffer-with-coding-system
将其切换到 Unix 或二进制(例如 Cx RET r unix),它们将始终显示为^ M。二进制编码系统也将显示任何非 ASCII 字符作为控制字符。
To see invisible characters, you can try
whitespace-mode
. Spaces and tabs will be displayed with a symbol in a different face. If the coding system is automatically being detected as dos (showing(DOS)
on the status bar), carriage returns at the end of a line will be hidden as well. Runrevert-buffer-with-coding-system
to switch it to Unix or binary (e.g. C-x RET r unix) and they'll always show up as^M
. The binary coding system will display any non-ASCII characters as control characters as well.默认情况下,Emacs 不会隐藏任何字符。按 Ctrl+Meta+%,或 Esc,然后按 Ctrl+ %(如果前者对您的手指来说太难),或者如果您愿意的话,可以使用
Mx Replace-regexp RET
。然后,对于正则表达式,输入“但是,在我编写
^H
的位置”,输入 Ctrl+Q,然后输入 Ctrl +H,按字面意思输入“control-H”字符,其他字符类似。您可以按 Ctrl+Q,然后按 Ctrl+Space 来表示^@
,然后通常是 Ctrl+Q,然后 Backspace 表示^?
。将所有出现此正则表达式的地方替换为空字符串。由于您已在 Emacs 中打开该文件,因此您可以在使用该文件时更改其行结尾。按
Cx RET f
(Ctrl+X Return F) 并输入us-ascii-unix
作为文件的新所需编码。Emacs won't hide any character by default. Press Ctrl+Meta+%, or Esc then Ctrl+% if the former is too hard on your fingers, or
M-x replace-regexp RET
if you prefer. Then, for the regular expression, enterHowever, where I wrote
^H
, type Ctrl+Q then Ctrl+H, to enter a “control-H” character literally, and similarly for the others. You can press Ctrl+Q then Ctrl+Space for^@
, and usually Ctrl+Q then Backspace for^?
. Replace all occurrences of this regular expression by the empty string.Since you have the file open in Emacs, you can change its line endings while you're at it. Press
C-x RET f
(Ctrl+X Return F) and enterus-ascii-unix
as the new desired encoding for the file.查看
Mx set-buffer-file-coding-system
。从文档中:因此,从 DOS 到 UNIX,
Mx set-buffer-file-coding-system unix
。Check out
M-x set-buffer-file-coding-system
. From the documentation:So, going from DOS to UNIX,
M-x set-buffer-file-coding-system unix
.