awk åäö元音变音字符的长度为 2

发布于 2024-12-07 01:48:57 字数 403 浏览 2 评论 0原文

我使用 awk (mac os x) 只打印 n 个字符及更长的行。

如果我在看起来像这样的文本文件(strings.txt)上尝试它:

four
foo
bar
föö
bår
fo
ba
fö
bå

并且我运行这个 awk 脚本:

awk ' { if( length($0) >= 3 ) print $0 } ' <strings.txt 

输出是:(

four
foo
bar
föö
bår
fö
bå

最后两行不应该被打印)。似乎包含元音变音字符 (å, ä, ö...) 的单词算作两个字符。

(输入文件以UTF8格式保存。)

I'm using awk (mac os x) to print only lines that are n characters and longer.

If I try it on a text file (strings.txt) that looks like this:

four
foo
bar
föö
bår
fo
ba
fö
bå

And I run this awk script:

awk ' { if( length($0) >= 3 ) print $0 } ' <strings.txt 

The output is:

four
foo
bar
föö
bår
fö
bå

(The last two lines should not have been printed). It seems like words that contain umlaut-characters (å, ä, ö...) count as two characters.

(The input file is saved in UTF8 format.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

屋檐 2024-12-14 01:48:57

BSD awk (又名 BWK awk),预装在 macOS 上(从 macOS 10.13 开始仍然如此),遗憾的是 - 不是Unicode 感知。

您的选择是:

  • 如果您知道所涉及的字符适合单字节编码例如ISO-8859-1,您可以使用 iconv,如下所示:

    iconv -f UTF-8 -t ISO-8859-1 文件 | awk '长度 >= 3' | iconv -f ISO-8859-1 -t UTF-8
    
  • 安装不同的 awk 实现 Unicode 感知,例如 gawk (GNU Awk) 或 mawk;例如,通过 Homebrew
    • 酿造信息gawk
    • 酿造信息mawk
  • 使用不同的预装工具 > Unicode 感知,例如 sed

    sed -n '/^.\{3,\}/p' 文件
    

BSD awk (a.k.a BWK awk), as preinstalled on macOS (still true as of macOS 10.13), is - sadly - NOT Unicode-aware.

Your choices are:

  • IF you know that the characters involved fit into a single-byte encoding such as ISO-8859-1, you can use iconv as follows:

    iconv -f UTF-8 -t ISO-8859-1 file | awk 'length >= 3' | iconv -f ISO-8859-1 -t UTF-8
    
  • Install a different awk implementation that is Unicode-aware, such as gawk (GNU Awk) or mawk; e.g., via Homebrew:
    • brew info gawk
    • brew info mawk
  • Use a different preinstalled tool that is Unicode-aware, such as sed:

    sed -n '/^.\{3,\}/p' file
    
跨年 2024-12-14 01:48:57

尝试设置您的区域设置:

LC_ALL=en_US.UTF-8 awk 'length >= 3' infile

将 en_US.UTF-8 更改为您正确的区域设置。

Try setting your locale:

LC_ALL=en_US.UTF-8 awk 'length >= 3' infile

Change en_US.UTF-8 to your correct locale.

小苏打饼 2024-12-14 01:48:57

如果您绝对确定您的输入已经是 100%“格式良好”的 UTF8 文本,那么您可以简单地用这个简短的代码片段(使用 -unicode 来计算长度) awk 的感知版本:

`mawk/mawk2/nawk/gawk -b 'BEGIN { FS = "^$" 
  
     } { bytes += length($0) 
         gsub(/[\200-\301\365-\377]+/,"") 

           # if u know it's well-formed
           # then once you clean out all continuation 
           # bytes, you're only left with ASCII and 
           # the multi-byte leading bytes of \xC2-\xF4

           chars += length($0) 
     } END { 
            printf("rows       = %\047.f | "\
                   "UTF8 chars = %\047.f | "\
                   "bytes      = %\047.f\n",\
                    NR, \
                    NR+chars,\
                    NR+bytes) }'  # remove the \047 for mawk2

并且不用担心它的区域设置。只要您处于 gawk 字节模式或任何非 unicode 感知变体,这就可以正常工作。

这将正确计算 Unicode 13 中指定的任何单点。

  • ps:具有 UTF16 代理对本身并不一定构成“格式良好”的 UTF8。

性能方面,它轻而易举地击败了 gnu-wc 二进制文件 - 在多字节繁重输入上快了约 67%,在多字节轻量输入上快了约 134%:

第一个文件

时间 pvE0 < “${m3t}”| mawk2 '开始 { FS = "^$" …. }

  in0: 1.85GiB 0:00:16 [ 114MiB/s] [ 114MiB/s] [============================>] 100%            

行 = 12494275。 UTF8 字符 = 1285316715。字节= 1983544693。pvE

0.1 in0 < "${m3t}" 0.07s 用户 0.78s 系统 5% cpu 16.575

时间 pvE0 < “${m3t}”| gwc-lcm
in0: 1.85GiB 0:00:27 [68.0MiB/s] [68.0MiB/s] [============================== >]100%
12494275 1285316715 1983544693
pvE 0.1 in0 < "${m3t}" 0.07s 用户 0.80s 系统 3% cpu 27.838 总计

第二个文件

  in0:  988MiB 0:00:03 [ 316MiB/s] [ 316MiB/s] [============================>] 100%            

行 = 5983333。 UTF8 字符 = 969069988。字节= 1036334374。pvE

0.1 in0 < "${m3s}" 0.04s 用户 0.60s 系统 20% cpu 3.177

时间 pvE0 < “${m3s}”| gwc-lcm
in0: 988MiB 0:00:07 [ 135MiB/s] [ 135MiB/s] [==============================>] 100%
5983333 969069988 1036334374
pvE 0.1 in0 < "${m3s}" 0.04s 用户 0.39s 系统 5% cpu 7.318 总计

if you're absolutely sure that your input is already 100% "well-formed" UTF8 texts, then you can simply count length with this short snippet, with non-unicode aware versions of awk :

`mawk/mawk2/nawk/gawk -b 'BEGIN { FS = "^
quot; 
  
     } { bytes += length($0) 
         gsub(/[\200-\301\365-\377]+/,"") 

           # if u know it's well-formed
           # then once you clean out all continuation 
           # bytes, you're only left with ASCII and 
           # the multi-byte leading bytes of \xC2-\xF4

           chars += length($0) 
     } END { 
            printf("rows       = %\047.f | "\
                   "UTF8 chars = %\047.f | "\
                   "bytes      = %\047.f\n",\
                    NR, \
                    NR+chars,\
                    NR+bytes) }'  # remove the \047 for mawk2

and don't worry about locale settings at it. as long as you're in gawk byte mode, or any of the non-unicode aware variants, this works just fine.

This shall properly count any single point spec'ed in Unicode 13.

  • ps : having UTF16 surrogate pairs doesn't necessarily constitute "well-formed" UTF8 per se.

performance wise, it beats gnu-wc binary rather handily - about +67% faster on multi-byte heavy input, some +134% on multi-byte lighter input :

first file

time pvE0 < "${m3t}" | mawk2 'BEGIN { FS = "^$" …. }

  in0: 1.85GiB 0:00:16 [ 114MiB/s] [ 114MiB/s] [============================>] 100%            

rows = 12494275. | UTF8 chars = 1285316715. | bytes = 1983544693.

pvE 0.1 in0 < "${m3t}" 0.07s user 0.78s system 5% cpu 16.575 total

time pvE0 < "${m3t}" | gwc -lcm
in0: 1.85GiB 0:00:27 [68.0MiB/s] [68.0MiB/s] [============================>] 100%
12494275 1285316715 1983544693
pvE 0.1 in0 < "${m3t}" 0.07s user 0.80s system 3% cpu 27.838 total

second file

  in0:  988MiB 0:00:03 [ 316MiB/s] [ 316MiB/s] [============================>] 100%            

rows = 5983333. | UTF8 chars = 969069988. | bytes = 1036334374.

pvE 0.1 in0 < "${m3s}" 0.04s user 0.60s system 20% cpu 3.177 total

time pvE0 < "${m3s}" | gwc -lcm
in0: 988MiB 0:00:07 [ 135MiB/s] [ 135MiB/s] [============================>] 100%
5983333 969069988 1036334374
pvE 0.1 in0 < "${m3s}" 0.04s user 0.39s system 5% cpu 7.318 total

故事↓在人 2024-12-14 01:48:57

试试这个:

$  echo "four
foo
bar
föö
bår
fo
ba
fö
bå
"|awk ' {x=$0;gsub(/./,"x",x); if( length(x) >= 3 ) print $0 } ' 

输出

four
foo
bar
föö
bår

try this:

$  echo "four
foo
bar
föö
bår
fo
ba
fö
bå
"|awk ' {x=$0;gsub(/./,"x",x); if( length(x) >= 3 ) print $0 } ' 

output

four
foo
bar
föö
bår
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文