使用 sed 剥离十六进制字节 - 不匹配
我有一个包含两个非 ascii 字节(0xFF 和 0xFE)的文本文件:
??58832520.3,ABC
348384,DEF
该文件的十六进制是:
FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46
巧合的是 FF 和 FE 恰好是前导字节(它们存在于整个文件中,尽管似乎总是在开头)一条线)。
我试图用 sed 删除这些字节,但我所做的一切似乎都与它们不匹配。
$ sed 's/[^a-zA-Z0-9\,]//g' test.csv
??588325203,ABC
348384,DEF
$ sed 's/[a-zA-Z0-9\,]//g' test.csv
??.
主要问题:如何剥离这些字节?
额外问题:上面的两个正则表达式是直接否定,因此其中一个在逻辑上必须过滤掉这些字节,对吧?为什么这两个正则表达式都匹配 0xFF 和 0xFE 字节?
更新:删除一系列十六进制字节的直接方法(由下面的两个答案建议)似乎从每行中删除第一个“合法”字节并留下我想要获取的字节删除:
$sed 's/[\x80-\xff]//' test.csv
??8832520.3,ABC
48384,DEF
FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A
注意每行开头缺少“5”和“3”,并且新的 0A 添加到文件末尾。
更大的更新:这个问题似乎是系统特定的。该问题是在 OSX 上观察到的,但这些建议(包括上面我原来的 sed 语句)在 NetBSD 上的效果正如我所期望的那样。
解决方案:通过 Perl,同样的任务似乎很容易:
$ perl -pe 's/^\xFF\xFE//' test.csv
58832520.3,ABC
348384,DEF
但是,我将保留这个问题,因为这只是一个解决方法,并且没有解释 sed 的问题是什么。
I have a text file with two non-ascii bytes (0xFF and 0xFE):
??58832520.3,ABC
348384,DEF
The hex for this file is:
FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46
It's coincidental that FF and FE happen to be the leading bytes (they exist throughout my file, although seemingly always at the beginning of a line).
I am trying to strip these bytes out with sed, but nothing I do seems to match them.
$ sed 's/[^a-zA-Z0-9\,]//g' test.csv
??588325203,ABC
348384,DEF
$ sed 's/[a-zA-Z0-9\,]//g' test.csv
??.
Main question: How do I strip these bytes?
Bonus question: The two regex's above are direct negations, so one of them logically has to filter out these bytes, right? Why do both of these regex's match the 0xFF and 0xFE bytes?
Update: the direct approach of stripping out a range of hex byte (suggested by two answers below) seems to strip out the first "legit" byte from each line and leave the bytes I'm trying to get rid of:
$sed 's/[\x80-\xff]//' test.csv
??8832520.3,ABC
48384,DEF
FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A
Notice the missing "5" and "3" from the beginning of each line, and the new 0A added to the end of the file.
Bigger Update: This problem seems to be system-specific. The problem was observed on OSX, but the suggestions (including my original sed statement above) work as I expect them to on NetBSD.
A solution: This same task seems easy enough via Perl:
$ perl -pe 's/^\xFF\xFE//' test.csv
58832520.3,ABC
348384,DEF
However, I'll leave this question open since this is only a workaround, and doesn't explain what the problem was with sed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
或者正如其他答案所暗示的那样,
请参阅 sed 信息页面的 第 3.9 节。这一章的标题是“逃脱”。
编辑 对于 OSX,本机语言设置是 en_US.UTF-8
尝试
这在 osx 机器上工作,我不完全确定为什么它在 UTF-8 中不起作用
or as the other answer implies
See section 3.9 of the sed info pages. The chapter entitled escapes.
Edit for OSX, the native lang setting is en_US.UTF-8
try
This works on an osx machine here, I'm not entirely sure why it does not work when in UTF-8
这将删除以特定字节 FF FE 开头的所有行。
否定的正则表达式不起作用的原因是 [] 指定了字符类。 sed 假设一个特定的字符集,可能是 ascii。文件中的这些字符不是 7 位 ascii 字符,因为它们都以 F 开头。 sed 不知道如何处理这些字符。上面的解决方案不使用字符类,因此它应该在平台和字符集之间更具可移植性。
This will strip out all lines that begin with the specific bytes FF FE
The reason that your negated regexes aren't working is that the [] specifies a character class. sed is assuming a particular character set, probably ascii. These characters in your file aren't 7 bit ascii characters, as they both begin with F. sed doesn't know how to deal with these. The solution above doesn't use character classes, so it should be more portable between platforms and character sets.
文件开头的
FF
和FE
字节称为“字节顺序标记 (BOM)”。它可以出现在 Unicode 文本流的开头以指示文本的字节顺序。FF FE
表示 Little Endian 中的 UTF-16下面是常见问题解答的摘录:
参考文献
另请参阅
相关问题
The
FF
andFE
bytes at the beginning of your file is what is called a "byte order mark (BOM)". It can appear at the start of Unicode text streams to indicate the endianness of the text.FF FE
indicates UTF-16 in Little EndianHere's an excerpt from the FAQ:
References
See also
Related questions
要表明这不是 Unicode BOM 的问题,而是八位字符与七位字符的问题并与区域设置相关,请尝试以下操作:
显示所有字节:
让
sed
删除用户区域设置中非字母数字的字符。请注意,空格和 0x7f 已被删除:让 sed 删除 C 语言环境中非字母数字的字符。请注意,仅保留“123abc”:
To show that this isn't an issue of the Unicode BOM, but an issue of eight-bit versus seven-bit characters and tied to the locale, try this:
Show all the bytes:
Have
sed
remove characters that aren't alpha-numeric in the user's locale. Notice that the space and 0x7f are removed:Have
sed
remove characters that aren't alpha-numeric in the C locale. Notice that only "123abc" remains:在 OS X 上,字节顺序标记可能被读取为单个字。根据字节顺序,尝试使用
sed 's/^\xfffe//g'
或sed 's/^\xfeff//g'
。On OS X, the Byte Order Mark is probably being read as a single word. Try either
sed 's/^\xfffe//g'
orsed 's/^\xfeff//g'
depending on endianess.您可以使用 \xff \xfE 获取十六进制代码,然后将其替换为空。
You can get the hex codes with \xff \xfE and replace it by nothing.
作为替代方案,您可以使用 ed(1):
As an alternative you may used ed(1):