使用awk去除字节顺序标记
awk
脚本(大概是一行)如何删除 BOM 看起来像什么?
规范:
- 打印第一行的第一行 (
NR > 1
) 之后 - 的每一行:如果以
#FE #FF
或#FF #FE< 开头/code>,删除那些并打印其余的
How would an awk
script (presumably a one-liner) for removing a BOM look like?
Specification:
- print every line after the first (
NR > 1
) - for the first line: If it starts with
#FE #FF
or#FF #FE
, remove those and print the rest
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
使用 GNU
sed
(在 Linux 或 Cygwin 上):在 FreeBSD 上:
使用 GNU 或 FreeBSD
sed
的优点:-i
参数表示“就地” ”,并且将更新文件,而不需要重定向或奇怪的技巧。在 Mac 上:
另一个答案中的这个
awk
解决方案有效,但是sed 上面的命令不起作用。至少在 Mac (Sierra) 上,
sed
文档没有提到支持十六进制转义 ala\xef
。通过从 moreutilssponge 工具,任何程序都可以实现类似的技巧。一个>:
Using GNU
sed
(on Linux or Cygwin):On FreeBSD:
Advantage of using GNU or FreeBSD
sed
: the-i
parameter means "in place", and will update files without the need for redirections or weird tricks.On Mac:
This
awk
solution in another answer works, but thesed
command above does not work. At least on Mac (Sierra)sed
documentation does not mention supporting hexadecimal escaping ala\xef
.A similar trick can be achieved with any program by piping to the
sponge
tool from moreutils:试试这个:
在第一条记录(行)上,删除 BOM 字符。打印每条记录。
或者稍微短一点,使用 awk 中的默认操作是打印记录的知识:
1
是始终评估为 true 的最短条件,因此会打印每条记录。享受!
-- 附录 --
Unicode 字节顺序标记 (BOM) 常见问题解答 包括下表,列出了每种编码的确切 BOM 字节:
因此,您可以看到
\xef\xbb\xbf
如何对应于EF BB BF
UTF-上表中的 8
BOM 字节。Try this:
On the first record (line), remove the BOM characters. Print every record.
Or slightly shorter, using the knowledge that the default action in awk is to print the record:
1
is the shortest condition that always evaluates to true, so each record is printed.Enjoy!
-- ADDENDUM --
Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:
Thus, you can see how
\xef\xbb\xbf
corresponds toEF BB BF
UTF-8
BOM bytes from the above table.不是 awk,但更简单:
检查 BOM:
如果存在 BOM,您将看到:
00000000 ef bb bf ...
Not awk, but simpler:
To check for BOM:
If BOM is present you'll see:
00000000 ef bb bf ...
除了将 CRLF 行尾转换为 LF 之外,
dos2unix
还删除了 BOM:dos2unix
还转换带有 BOM 的 UTF-16 文件(但不转换没有 BOM 的 UTF-16 文件)转为无 BOM 的 UTF-8:In addition to converting CRLF line endings to LF,
dos2unix
also removes BOMs:dos2unix
also converts UTF-16 files with a BOM (but not UTF-16 files without a BOM) to UTF-8 without a BOM:我知道这个问题是针对 unix/linux 的,认为对于 unix 挑战的人来说,值得一提的是一个不错的选择(在 Windows 上,带有 UI)。
我在 WordPress 项目中遇到了同样的问题(BOM 导致 rss feed 和页面验证出现问题),我必须查看相当大的目录树中的所有文件才能找到包含 BOM 的文件。找到一个名为 Replace Pioneer 的应用程序,其中:
Batch Runner ->搜索(查找子文件夹中的所有文件)->替换模板->二进制删除 BOM(有一个现成的搜索和替换模板)。
这不是最优雅的解决方案,并且确实需要安装程序,这是一个缺点。但一旦我发现了我周围发生的事情,它就像一个魅力(并从大约 2300 个文件中找到了 3 个带有 BOM 的文件)。
I know the question was directed at unix/linux, thought it would be worth to mention a good option for the unix-challenged (on windows, with a UI).
I ran into the same issue on a WordPress project (BOM was causing problems with rss feed and page validation) and I had to look into all the files in a quite big directory tree to find the one that was with BOM. Found an application called Replace Pioneer and in it:
Batch Runner -> Search (to find all the files in the subfolders) -> Replace Template -> Binary remove BOM (there is a ready made search and replace template for this).
It was not the most elegant solution and it did require installing a program, which is a downside. But once I found out what was going around me, it worked like a charm (and found 3 files out of about 2300 that were with BOM).