使用awk去除字节顺序标记

发布于 2024-10-30 04:44:38 字数 295 浏览 1 评论 0原文

awk 脚本(大概是一行)如何删除 BOM 看起来像什么?

规范:

  • 打印第一行的第一行 (NR > 1) 之后
  • 的每一行:如果以 #FE #FF#FF #FE< 开头/code>,删除那些并打印其余的

How would an awk script (presumably a one-liner) for removing a BOM look like?

Specification:

  • print every line after the first (NR > 1)
  • for the first line: If it starts with #FE #FF or #FF #FE, remove those and print the rest

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

烟花易冷人易散 2024-11-06 04:44:38

使用 GNU sed(在 Linux 或 Cygwin 上):

# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt

在 FreeBSD 上:

sed -i .bak '1 s/^\xef\xbb\xbf//' *.txt

使用 GNU 或 FreeBSD sed 的优点:-i 参数表示“就地” ”,并且将更新文件,而不需要重定向或奇怪的技巧。

在 Mac 上:

另一个答案中的这个 awk 解决方案有效,但是 sed 上面的命令不起作用。至少在 Mac (Sierra) 上,sed 文档没有提到支持十六进制转义 ala \xef

通过从 moreutilssponge 工具,任何程序都可以实现类似的技巧。一个>:

awk '…' INFILE | sponge INFILE

Using GNU sed (on Linux or Cygwin):

# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt

On FreeBSD:

sed -i .bak '1 s/^\xef\xbb\xbf//' *.txt

Advantage of using GNU or FreeBSD sed: the -i parameter means "in place", and will update files without the need for redirections or weird tricks.

On Mac:

This awk solution in another answer works, but the sed command above does not work. At least on Mac (Sierra) sed documentation does not mention supporting hexadecimal escaping ala \xef.

A similar trick can be achieved with any program by piping to the sponge tool from moreutils:

awk '…' INFILE | sponge INFILE
卸妝后依然美 2024-11-06 04:44:38

试试这个:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE

在第一条记录(行)上,删除 BOM 字符。打印每条记录。

或者稍微短一点,使用 awk 中的默认操作是打印记录的知识:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE

1 是始终评估为 true 的最短条件,因此会打印每条记录。

享受!

-- 附录 --

Unicode 字节顺序标记 (BOM) 常见问题解答 包括下表,列出了每种编码的确切 BOM 字节:

Bytes         |  Encoding Form
--------------------------------------
00 00 FE FF   |  UTF-32, big-endian
FF FE 00 00   |  UTF-32, little-endian
FE FF         |  UTF-16, big-endian
FF FE         |  UTF-16, little-endian
EF BB BF      |  UTF-8

因此,您可以看到 \xef\xbb\xbf 如何对应于 EF BB BF UTF-上表中的 8 BOM 字节。

Try this:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE

On the first record (line), remove the BOM characters. Print every record.

Or slightly shorter, using the knowledge that the default action in awk is to print the record:

awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE

1 is the shortest condition that always evaluates to true, so each record is printed.

Enjoy!

-- ADDENDUM --

Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:

Bytes         |  Encoding Form
--------------------------------------
00 00 FE FF   |  UTF-32, big-endian
FF FE 00 00   |  UTF-32, little-endian
FE FF         |  UTF-16, big-endian
FF FE         |  UTF-16, little-endian
EF BB BF      |  UTF-8

Thus, you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.

夜访吸血鬼 2024-11-06 04:44:38

不是 awk,但更简单:

tail -c +4 UTF8 > UTF8.nobom

检查 BOM:

hd -n 3 UTF8

如果存在 BOM,您将看到:00000000 ef bb bf ...

Not awk, but simpler:

tail -c +4 UTF8 > UTF8.nobom

To check for BOM:

hd -n 3 UTF8

If BOM is present you'll see: 00000000 ef bb bf ...

余生一个溪 2024-11-06 04:44:38

除了将 CRLF 行尾转换为 LF 之外,dos2unix 还删除了 BOM:

dos2unix *.txt

dos2unix 还转换带有 BOM 的 UTF-16 文件(但不转换没有 BOM 的 UTF-16 文件)转为无 BOM 的 UTF-8:

$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16be>bom-utf16be
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16le>bom-utf16le
$ printf '\ufeffä\n'>bom-utf8
$ printf 'ä\n'|iconv -f utf-8 -t utf-16be>utf16be
$ printf 'ä\n'|iconv -f utf-8 -t utf-16le>utf16le
$ printf 'ä\n'>utf8
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be feff00e4000a
bom-utf16le fffee4000a00
   bom-utf8 efbbbfc3a40a
    utf16be 00e4000a
    utf16le e4000a00
       utf8 c3a40a
$ dos2unix -q *
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be c3a40a
bom-utf16le c3a40a
   bom-utf8 c3a40a
    utf16be 00e4000a
    utf16le e4000a00
       utf8 c3a40a

In addition to converting CRLF line endings to LF, dos2unix also removes BOMs:

dos2unix *.txt

dos2unix also converts UTF-16 files with a BOM (but not UTF-16 files without a BOM) to UTF-8 without a BOM:

$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16be>bom-utf16be
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16le>bom-utf16le
$ printf '\ufeffä\n'>bom-utf8
$ printf 'ä\n'|iconv -f utf-8 -t utf-16be>utf16be
$ printf 'ä\n'|iconv -f utf-8 -t utf-16le>utf16le
$ printf 'ä\n'>utf8
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be feff00e4000a
bom-utf16le fffee4000a00
   bom-utf8 efbbbfc3a40a
    utf16be 00e4000a
    utf16le e4000a00
       utf8 c3a40a
$ dos2unix -q *
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be c3a40a
bom-utf16le c3a40a
   bom-utf8 c3a40a
    utf16be 00e4000a
    utf16le e4000a00
       utf8 c3a40a
对不⑦ 2024-11-06 04:44:38

我知道这个问题是针对 unix/linux 的,认为对于 unix 挑战的人来说,值得一提的是一个不错的选择(在 Windows 上,带有 UI)。

我在 WordPress 项目中遇到了同样的问题(BOM 导致 rss feed 和页面验证出现问题),我必须查看相当大的目录树中的所有文件才能找到包含 BOM 的文件。找到一个名为 Replace Pioneer 的应用程序,其中:

Batch Runner ->搜索(查找子文件夹中的所有文件)->替换模板->二进制删除 BOM(有一个现成的搜索和替换模板)。

这不是最优雅的解决方案,并且确实需要安装程序,这是一个缺点。但一旦我发现了我周围发生的事情,它就像一个魅力(并从大约 2300 个文件中找到了 3 个带有 BOM 的文件)。

I know the question was directed at unix/linux, thought it would be worth to mention a good option for the unix-challenged (on windows, with a UI).

I ran into the same issue on a WordPress project (BOM was causing problems with rss feed and page validation) and I had to look into all the files in a quite big directory tree to find the one that was with BOM. Found an application called Replace Pioneer and in it:

Batch Runner -> Search (to find all the files in the subfolders) -> Replace Template -> Binary remove BOM (there is a ready made search and replace template for this).

It was not the most elegant solution and it did require installing a program, which is a downside. But once I found out what was going around me, it worked like a charm (and found 3 files out of about 2300 that were with BOM).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文