gawk 中的正则表达式疑问
我的csv数据文件现在就像这样
title,name,gender
MRS.,MADHU,Female
MRS.,RAJ KUMAR,male
MR.,N,Male
MRS.,SHASHI,Female
MRS.,ALKA,Female
,你可以看到我想避免像第2行和第3行这样的所有数据(即没有空格或数据长度> = 3)
MRS.,RAJ KUMAR,male
MR.,N,Male
并将其放置在名为rejected_list.csv的文件中,其余全部放入一个名为 clean_list.csv 的文件
,因此这是我的 gawk 脚本
gawk -F ',' '{
if( $2 ~ /\S/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
我的问题是这个脚本无法识别 '\S' 字符集(除空格之外的所有字母表)..它选择所有以 S 开头或有 S 的单词并拒绝其余的部分
像 /([AZ])/ 这样的简单正则表达式代替 /s 工作得很好,但是当我将限制设置为 {3,} 时,脚本失败了..
gawk -F ',' '{
if( $2 ~ /([A-Z]){3,}/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
我尝试了正则表达式与 '*',' 的各种组合+'等等,但我无法得到我想要的...
任何人都可以告诉我问题是什么?
my csv data file is like this
title,name,gender
MRS.,MADHU,Female
MRS.,RAJ KUMAR,male
MR.,N,Male
MRS.,SHASHI,Female
MRS.,ALKA,Female
now as you can see i wanna avoid all data like line 2 and 3 (i.e no white space or data length >= 3 )
MRS.,RAJ KUMAR,male
MR.,N,Male
and place it in a file called rejected_list.csv, rest all go in a file called clean_list.csv
hence here is my gawk script for it
gawk -F ',' '{
if( $2 ~ /\S/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
My problem is this script is not recognising '\S' character set( all alphabets except space).. it is selecting all words starting with S or has a S and rejecting the rest
a simple regex like /([A-Z])/ in place of /s works perfectly but as i place a limit of {3,} the script fails..
gawk -F ',' '{
if( $2 ~ /([A-Z]){3,}/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
i have tried all sorts of combination of the regex with '*','+' etc but i cant get what i want...
can anyone tell me what is the problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于所有可打印和可见字符,请使用 [:graph:] 而不是 \S。 GAWK 无法将 \S 识别为 [:graph:],因此它不起作用。
此外,{3,} 间隔表达式仅适用于 posix 或重新间隔模式。
Use [:graph:] instead of \S for all printable and visible characters. GAWK does not recognize \S as [:graph:] so it will not work.
Additionally, the {3,} interval expression only works in posix or re-interval modes.
我添加了拒绝条件:不完全是3个字段
I added a rejection condition: not exactly 3 fields