awk åäö元音变音字符的长度为 2
我使用 awk (mac os x) 只打印 n 个字符及更长的行。
如果我在看起来像这样的文本文件(strings.txt)上尝试它:
four
foo
bar
föö
bår
fo
ba
fö
bå
并且我运行这个 awk 脚本:
awk ' { if( length($0) >= 3 ) print $0 } ' <strings.txt
输出是:(
four
foo
bar
föö
bår
fö
bå
最后两行不应该被打印)。似乎包含元音变音字符 (å, ä, ö...) 的单词算作两个字符。
(输入文件以UTF8格式保存。)
I'm using awk (mac os x) to print only lines that are n characters and longer.
If I try it on a text file (strings.txt) that looks like this:
four
foo
bar
föö
bår
fo
ba
fö
bå
And I run this awk script:
awk ' { if( length($0) >= 3 ) print $0 } ' <strings.txt
The output is:
four
foo
bar
föö
bår
fö
bå
(The last two lines should not have been printed). It seems like words that contain umlaut-characters (å, ä, ö...) count as two characters.
(The input file is saved in UTF8 format.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
BSD
awk
(又名 BWKawk
),预装在 macOS 上(从 macOS 10.13 开始仍然如此),遗憾的是 - 不是Unicode 感知。您的选择是:
如果您知道所涉及的字符适合单字节编码例如ISO-8859-1,您可以使用
iconv
,如下所示:awk
实现,是 Unicode 感知,例如gawk
(GNU Awk) 或mawk
;例如,通过 Homebrew:酿造信息gawk
酿造信息mawk
使用不同的预装工具,是 > Unicode 感知,例如
sed
:BSD
awk
(a.k.a BWKawk
), as preinstalled on macOS (still true as of macOS 10.13), is - sadly - NOT Unicode-aware.Your choices are:
IF you know that the characters involved fit into a single-byte encoding such as ISO-8859-1, you can use
iconv
as follows:awk
implementation that is Unicode-aware, such asgawk
(GNU Awk) ormawk
; e.g., via Homebrew:brew info gawk
brew info mawk
Use a different preinstalled tool that is Unicode-aware, such as
sed
:尝试设置您的区域设置:
将 en_US.UTF-8 更改为您正确的区域设置。
Try setting your locale:
Change en_US.UTF-8 to your correct locale.
如果您绝对确定您的输入已经是 100%“格式良好”的 UTF8 文本,那么您可以简单地用这个简短的代码片段(使用 非-unicode 来计算长度) awk 的感知版本:
并且不用担心它的区域设置。只要您处于 gawk 字节模式或任何非 unicode 感知变体,这就可以正常工作。
这将正确计算 Unicode 13 中指定的任何单点。
性能方面,它轻而易举地击败了 gnu-wc 二进制文件 - 在多字节繁重输入上快了约 67%,在多字节轻量输入上快了约 134%:
时间 pvE0 < “${m3t}”| mawk2 '开始 { FS = "^$" …. }
行 = 12494275。 UTF8 字符 = 1285316715。字节= 1983544693。pvE
0.1 in0 < "${m3t}" 0.07s 用户 0.78s 系统 5% cpu 16.575 总
时间 pvE0 < “${m3t}”| gwc-lcm
in0: 1.85GiB 0:00:27 [68.0MiB/s] [68.0MiB/s] [============================== >]100%
12494275 1285316715 1983544693
pvE 0.1 in0 < "${m3t}" 0.07s 用户 0.80s 系统 3% cpu 27.838 总计
行 = 5983333。 UTF8 字符 = 969069988。字节= 1036334374。pvE
0.1 in0 < "${m3s}" 0.04s 用户 0.60s 系统 20% cpu 3.177 总
时间 pvE0 < “${m3s}”| gwc-lcm
in0: 988MiB 0:00:07 [ 135MiB/s] [ 135MiB/s] [==============================>] 100%
5983333 969069988 1036334374
pvE 0.1 in0 < "${m3s}" 0.04s 用户 0.39s 系统 5% cpu 7.318 总计
if you're absolutely sure that your input is already 100% "well-formed" UTF8 texts, then you can simply count length with this short snippet, with non-unicode aware versions of awk :
and don't worry about locale settings at it. as long as you're in gawk byte mode, or any of the non-unicode aware variants, this works just fine.
This shall properly count any single point spec'ed in Unicode 13.
performance wise, it beats gnu-wc binary rather handily - about +67% faster on multi-byte heavy input, some +134% on multi-byte lighter input :
time pvE0 < "${m3t}" | mawk2 'BEGIN { FS = "^$" …. }
rows = 12494275. | UTF8 chars = 1285316715. | bytes = 1983544693.
pvE 0.1 in0 < "${m3t}" 0.07s user 0.78s system 5% cpu 16.575 total
time pvE0 < "${m3t}" | gwc -lcm
in0: 1.85GiB 0:00:27 [68.0MiB/s] [68.0MiB/s] [============================>] 100%
12494275 1285316715 1983544693
pvE 0.1 in0 < "${m3t}" 0.07s user 0.80s system 3% cpu 27.838 total
rows = 5983333. | UTF8 chars = 969069988. | bytes = 1036334374.
pvE 0.1 in0 < "${m3s}" 0.04s user 0.60s system 20% cpu 3.177 total
time pvE0 < "${m3s}" | gwc -lcm
in0: 988MiB 0:00:07 [ 135MiB/s] [ 135MiB/s] [============================>] 100%
5983333 969069988 1036334374
pvE 0.1 in0 < "${m3s}" 0.04s user 0.39s system 5% cpu 7.318 total
试试这个:
输出
try this:
output