如何仅提取英语单词并将devanagari单词留在bash脚本中?
文本文件是这样的,
#एक
1के
अंकगणित8IU
अधोरेखाunderscore
$thatऔर
%redएकyellow
$चिह्न
अंडरस्कोर@_
所需的文本文件应该像,
#
1
8IU
underscore
$that
%redyellow
$
@_
这是我到目前为止尝试过的,使用awk
awk -f -f” [अ--ह]* $ 1}'filename.txt
和我要获得的输出,
#
1
$that
%red
$
并且使用此awk -f“ [अ-ह]*”'{print $ 1,$ 2}'filename.txt
,我得到了这样的输出,
#
1 े
ं
ो
$that
%red yellow
$ ि
ं
无论如何,在BASH脚本中是否可以解决这个问题?
The text file is like this,
#एक
1के
अंकगणित8IU
अधोरेखाunderscore
$thatऔर
%redएकyellow
$चिह्न
अंडरस्कोर@_
The desired text file should be like,
#
1
8IU
underscore
$that
%redyellow
$
@_
This is what I have tried so far, using awk
awk -F"[अ-ह]*" '{print $1}' filename.txt
And the output that I am getting is,
#
1
$that
%red
$
and using this awk -F"[अ-ह]*" '{print $1,$2}' filename.txt
and I am getting an output like this,
#
1 े
ं
ो
$that
%red yellow
$ ि
ं
Is there anyway to solve this in bash script?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用Perl:
-CSD
告诉Perl标准流和任何打开的文件均在UTF-8中编码。-p
在执行> -e
给出的脚本后,输入文件上的输入文件将每行打印到标准输出。如果要修改已适当的文件,请添加-i
选项。正则表达式匹配Unicode标准中分配给Devanagari脚本的任何编码点并将其删除。使用
\ p {devanagari}
进行相反的操作并删除非devanagari字符。Using perl:
-CSD
tells perl that standard streams and any opened files are encoded in UTF-8.-p
loops over input files printing each line to standard output after executing the script given by-e
. If you want to modify the file in place, add the-i
option.The regular expression matches any codepoints assigned to the Devanagari script in the Unicode standard and removes them. Use
\P{Devanagari}
to do the opposite and remove the non-Devanagari characters.使用
awk
您可以做:Using
awk
you can do:TR
非常适合此任务:它设置了POSIX C环境环境,因此只有我们英语字符集有效。
然后指示
tr
到-d
delete-c
refforment[:cntrl:] [:graph:]
,控制和绘制字符类(不控制或可见的字符)字符。由于将所有语言环境设置设置为c
,因此所有非US-英语字符均已丢弃。tr
is a very good fit for this task:It sets the POSIX C locale environment so that only US English character set is valid.
Then instructs
tr
to-d
delete-c
complement[:cntrl:][:graph:]
, control and drawn characters classes (those not control or visible) characters. Since it is sets all the locale setting toC
, all non-US-English characters are discarded.