如何仅提取英语单词并将devanagari单词留在bash脚本中?

发布于 2025-01-23 07:41:13 字数 538 浏览 2 评论 0原文

文本文件是这样的,

#एक
1के
अंकगणित8IU
अधोरेखाunderscore
$thatऔर
%redएकyellow
$चिह्न
अंडरस्कोर@_

所需的文本文件应该像,

#
1
8IU
underscore
$that
%redyellow
$
@_

这是我到目前为止尝试过的,使用awk

awk -f -f” [अ--ह]* $ 1}'filename.txt 和我要获得的输出,

#
1


$that
%red
$

并且使用此awk -f“ [अ-ह]*”'{print $ 1,$ 2}'filename.txt,我得到了这样的输出,

# 
1 े
 ं
 ो
$that 
%red yellow
$ ि
 ं

无论如何,在BASH脚本中是否可以解决这个问题?

The text file is like this,

#एक
1के
अंकगणित8IU
अधोरेखाunderscore
$thatऔर
%redएकyellow
$चिह्न
अंडरस्कोर@_

The desired text file should be like,

#
1
8IU
underscore
$that
%redyellow
$
@_

This is what I have tried so far, using awk

awk -F"[अ-ह]*" '{print $1}' filename.txt
And the output that I am getting is,

#
1


$that
%red
$

and using this awk -F"[अ-ह]*" '{print $1,$2}' filename.txt and I am getting an output like this,

# 
1 े
 ं
 ो
$that 
%red yellow
$ ि
 ं

Is there anyway to solve this in bash script?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦醒时光 2025-01-30 07:41:13

使用Perl:

nbsp;perl -CSD -lpe 's/\p{Devanagari}+//g' input.txt
#
1
8IU
underscore
$that
%redyellow
$
@_

-CSD告诉Perl标准流和任何打开的文件均在UTF-8中编码。 -p在执行> -e给出的脚本后,输入文件上的输入文件将每行打印到标准输出。如果要修改已适当的文件,请添加-i选项。

正则表达式匹配Unicode标准中分配给Devanagari脚本的任何编码点并将其删除。使用\ p {devanagari}进行相反的操作并删除非devanagari字符。

Using perl:

$ perl -CSD -lpe 's/\p{Devanagari}+//g' input.txt
#
1
8IU
underscore
$that
%redyellow
$
@_

-CSD tells perl that standard streams and any opened files are encoded in UTF-8. -p loops over input files printing each line to standard output after executing the script given by -e. If you want to modify the file in place, add the -i option.

The regular expression matches any codepoints assigned to the Devanagari script in the Unicode standard and removes them. Use \P{Devanagari} to do the opposite and remove the non-Devanagari characters.

稍尽春風 2025-01-30 07:41:13

使用awk您可以做:

awk '{sub(/[^\x00-\x7F]+/, "")} 1' file
#
1
8IU
underscore
$that
%redyellow

使用[\ x00- \ x7f]
这是在零和127之间匹配的所有值,这是ASCII字符集的定义范围。使用互补的字符列表[^\ x00- \ x7f]匹配任何不在ASCII范围内的单字节字符。

Using awk you can do:

awk '{sub(/[^\x00-\x7F]+/, "")} 1' file
#
1
8IU
underscore
$that
%redyellow

using [\x00-\x7F].
This matches all values numerically between zero and 127, which is the defined range of the ASCII character set. Use a complemented character list [^\x00-\x7F] to match any single-byte characters that are not in the ASCII range.

兮颜 2025-01-30 07:41:13

TR非常适合此任务:

LC_ALL=C tr -c -d '[:cntrl:][:graph:]' < input.txt

它设置了POSIX C环境环境,因此只有我们英语字符集有效。

然后指示tr-d delete -c refforment [:cntrl:] [:graph:],控制和绘制字符类(不控制或可见的字符)字符。由于将所有语言环境设置设置为c,因此所有非US-英语字符均已丢弃。

tr is a very good fit for this task:

LC_ALL=C tr -c -d '[:cntrl:][:graph:]' < input.txt

It sets the POSIX C locale environment so that only US English character set is valid.

Then instructs tr to -d delete -c complement [:cntrl:][:graph:], control and drawn characters classes (those not control or visible) characters. Since it is sets all the locale setting to C, all non-US-English characters are discarded.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文