awk åäö元音变音字符的长度为 2

发布于 2024-12-07 01:48:57 字数 403 浏览 2 评论 0原文

我使用 awk (mac os x) 只打印 n 个字符及更长的行。

如果我在看起来像这样的文本文件（strings.txt）上尝试它：

four
foo
bar
föö
bår
fo
ba
fö
bå

并且我运行这个 awk 脚本：

awk ' { if( length($0) >= 3 ) print $0 } ' <strings.txt

输出是：（

four
foo
bar
föö
bår
fö
bå

最后两行不应该被打印）。似乎包含元音变音字符 (å, ä, ö...) 的单词算作两个字符。

（输入文件以UTF8格式保存。）

原文

I'm using awk (mac os x) to print only lines that are n characters and longer.

If I try it on a text file (strings.txt) that looks like this:

four
foo
bar
föö
bår
fo
ba
fö
bå

And I run this awk script:

awk ' { if( length($0) >= 3 ) print $0 } ' <strings.txt

The output is:

four
foo
bar
föö
bår
fö
bå

(The last two lines should not have been printed). It seems like words that contain umlaut-characters (å, ä, ö...) count as two characters.

(The input file is saved in UTF8 format.)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

屋檐 2024-12-14 01:48:57

BSD awk （又名 BWK awk），预装在 macOS 上（从 macOS 10.13 开始仍然如此），遗憾的是 - 不是Unicode 感知。

您的选择是：

如果您知道所涉及的字符适合单字节编码例如ISO-8859-1，您可以使用 iconv，如下所示：
```
iconv -f UTF-8 -t ISO-8859-1 文件 | awk '长度 >= 3' | iconv -f ISO-8859-1 -t UTF-8
```
安装不同的 awk 实现，是 Unicode 感知，例如 gawk (GNU Awk) 或 mawk;例如，通过 Homebrew：
- 酿造信息gawk
- 酿造信息mawk
使用不同的预装工具，是 > Unicode 感知，例如 sed：
```
sed -n '/^.\{3,\}/p' 文件
```

BSD awk (a.k.a BWK awk), as preinstalled on macOS (still true as of macOS 10.13), is - sadly - NOT Unicode-aware.

Your choices are:

IF you know that the characters involved fit into a single-byte encoding such as ISO-8859-1, you can use iconv as follows:
```
iconv -f UTF-8 -t ISO-8859-1 file | awk 'length >= 3' | iconv -f ISO-8859-1 -t UTF-8
```
Install a different awk implementation that is Unicode-aware, such as gawk (GNU Awk) or mawk; e.g., via Homebrew:
- brew info gawk
- brew info mawk
Use a different preinstalled tool that is Unicode-aware, such as sed:
```
sed -n '/^.\{3,\}/p' file
```

回复收藏 0 原文

跨年 2024-12-14 01:48:57

尝试设置您的区域设置：

LC_ALL=en_US.UTF-8 awk 'length >= 3' infile

将 en_US.UTF-8 更改为您正确的区域设置。

Try setting your locale:

LC_ALL=en_US.UTF-8 awk 'length >= 3' infile

Change en_US.UTF-8 to your correct locale.

回复收藏 0 原文

小苏打饼 2024-12-14 01:48:57

如果您绝对确定您的输入已经是 100%“格式良好”的 UTF8 文本，那么您可以简单地用这个简短的代码片段（使用非-unicode 来计算长度） awk 的感知版本：

`mawk/mawk2/nawk/gawk -b 'BEGIN { FS = "^$" 
  
     } { bytes += length($0) 
         gsub(/[\200-\301\365-\377]+/,"") 

           # if u know it's well-formed
           # then once you clean out all continuation 
           # bytes, you're only left with ASCII and 
           # the multi-byte leading bytes of \xC2-\xF4

           chars += length($0) 
     } END { 
            printf("rows       = %\047.f | "\
                   "UTF8 chars = %\047.f | "\
                   "bytes      = %\047.f\n",\
                    NR, \
                    NR+chars,\
                    NR+bytes) }'  # remove the \047 for mawk2

并且不用担心它的区域设置。只要您处于 gawk 字节模式或任何非 unicode 感知变体，这就可以正常工作。

这将正确计算 Unicode 13 中指定的任何单点。

ps：具有 UTF16 代理对本身并不一定构成“格式良好”的 UTF8。

性能方面，它轻而易举地击败了 gnu-wc 二进制文件 - 在多字节繁重输入上快了约 67%，在多字节轻量输入上快了约 134%：

第一个文件

时间 pvE0 < “${m3t}”| mawk2 '开始 { FS = "^$" …. }

  in0: 1.85GiB 0:00:16 [ 114MiB/s] [ 114MiB/s] [============================>] 100%

行 = 12494275。 UTF8 字符 = 1285316715。字节= 1983544693。pvE

0.1 in0 < "${m3t}" 0.07s 用户 0.78s 系统 5% cpu 16.575 总

时间 pvE0 < “${m3t}”| gwc-lcm
in0: 1.85GiB 0:00:27 [68.0MiB/s] [68.0MiB/s] [============================== >]100%
12494275 1285316715 1983544693
pvE 0.1 in0 < "${m3t}" 0.07s 用户 0.80s 系统 3% cpu 27.838 总计

第二个文件

  in0:  988MiB 0:00:03 [ 316MiB/s] [ 316MiB/s] [============================>] 100%

行 = 5983333。 UTF8 字符 = 969069988。字节= 1036334374。pvE

0.1 in0 < "${m3s}" 0.04s 用户 0.60s 系统 20% cpu 3.177 总

时间 pvE0 < “${m3s}”| gwc-lcm
in0: 988MiB 0:00:07 [ 135MiB/s] [ 135MiB/s] [==============================>] 100%
5983333 969069988 1036334374
pvE 0.1 in0 < "${m3s}" 0.04s 用户 0.39s 系统 5% cpu 7.318 总计

if you're absolutely sure that your input is already 100% "well-formed" UTF8 texts, then you can simply count length with this short snippet, with non-unicode aware versions of awk :

`mawk/mawk2/nawk/gawk -b 'BEGIN { FS = "^quot; 
  
     } { bytes += length($0) 
         gsub(/[\200-\301\365-\377]+/,"") 

           # if u know it's well-formed
           # then once you clean out all continuation 
           # bytes, you're only left with ASCII and 
           # the multi-byte leading bytes of \xC2-\xF4

           chars += length($0) 
     } END { 
            printf("rows       = %\047.f | "\
                   "UTF8 chars = %\047.f | "\
                   "bytes      = %\047.f\n",\
                    NR, \
                    NR+chars,\
                    NR+bytes) }'  # remove the \047 for mawk2

and don't worry about locale settings at it. as long as you're in gawk byte mode, or any of the non-unicode aware variants, this works just fine.

This shall properly count any single point spec'ed in Unicode 13.

ps : having UTF16 surrogate pairs doesn't necessarily constitute "well-formed" UTF8 per se.

performance wise, it beats gnu-wc binary rather handily - about +67% faster on multi-byte heavy input, some +134% on multi-byte lighter input :

first file

time pvE0 < "${m3t}" | mawk2 'BEGIN { FS = "^$" …. }

  in0: 1.85GiB 0:00:16 [ 114MiB/s] [ 114MiB/s] [============================>] 100%

rows = 12494275. | UTF8 chars = 1285316715. | bytes = 1983544693.

pvE 0.1 in0 < "${m3t}" 0.07s user 0.78s system 5% cpu 16.575 total

time pvE0 < "${m3t}" | gwc -lcm
in0: 1.85GiB 0:00:27 [68.0MiB/s] [68.0MiB/s] [============================>] 100%
12494275 1285316715 1983544693
pvE 0.1 in0 < "${m3t}" 0.07s user 0.80s system 3% cpu 27.838 total

second file

  in0:  988MiB 0:00:03 [ 316MiB/s] [ 316MiB/s] [============================>] 100%

rows = 5983333. | UTF8 chars = 969069988. | bytes = 1036334374.

pvE 0.1 in0 < "${m3s}" 0.04s user 0.60s system 20% cpu 3.177 total

time pvE0 < "${m3s}" | gwc -lcm
in0: 988MiB 0:00:07 [ 135MiB/s] [ 135MiB/s] [============================>] 100%
5983333 969069988 1036334374
pvE 0.1 in0 < "${m3s}" 0.04s user 0.39s system 5% cpu 7.318 total

回复收藏 0 原文

故事↓在人 2024-12-14 01:48:57

试试这个：

$  echo "four
foo
bar
föö
bår
fo
ba
fö
bå
"|awk ' {x=$0;gsub(/./,"x",x); if( length(x) >= 3 ) print $0 } '

输出

four
foo
bar
föö
bår

try this:

$  echo "four
foo
bar
föö
bår
fo
ba
fö
bå
"|awk ' {x=$0;gsub(/./,"x",x); if( length(x) >= 3 ) print $0 } '

output

four
foo
bar
föö
bår

回复收藏 0 原文

~没有更多了~

关于作者

宣告ˉ结束

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

awk åäö元音变音字符的长度为 2

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

紫罗兰の梦幻

-2134

liuxuanli

意中人

○愚か者の日

xxhui

友情链接

awk åäö元音变音字符的长度为 2

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

紫罗兰の梦幻

-2134

liuxuanli

意中人

○愚か者の日

xxhui

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。