使用AWK如何删除这些重复项?
我是AWK的新手,我对AWK有一些基本的想法。我想删除文件中的重复项, 例如:
0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.
这是一个示例文件,使用此命令我得到如下输出:
awk 'BEGIN{IGNORECASE=1}/^[^ ]+asia/ { gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[$1]++;}END{for (x in b)print x}'
0008.亚洲。
anish.asia.
ANISH.asia
但我想要这样的输出
008.ASIA
anish.asia
或
008.ASIA
ANISH.asia
如何删除这些重复项?
提前致谢 Anish kumar.V
感谢您的立即回复,实际上我用 bash 编写了完整的脚本,现在处于最后阶段。如何在其中调用 python :-(
#!/bin/bash
current_date=`date +%d-%m-%Y_%H.%M.%S`
today=`date +%d%m%Y`
yesterday=`date -d 'yesterday' '+%d%m%Y'`
RootPath=/var/domaincount/asia/
MainPath=$RootPath${today}asia
LOG=/var/tmp/log/asia/asiacount$current_date.log
mkdir -p $MainPath
echo Intelliscan Process started for Asia TLD $current_date
exec 6>&1 >> $LOG
#################################################################################################
## Using Wget Downloading the Zone files it will try only one time
if ! wget --tries=1 --ftp-user=USERNAME --ftp-password=PASSWORD ftp://ftp.anish.com:21/zonefile/anish.zone.gz
then
echo Download Not Success Domain count Failed With Error
exit 1
fi
###The downloaded file in Gunzip format from that we need to unzip and start the domain count process####
gunzip asia.zone.gz > $MainPath/$today.asia
###### It will start the Count #####
awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia
awk '/Total/ {print $2}' $RootPath/zonefile/$today.asia > $RootPath/$today.count
a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.asia $RootPath/zonefile/$yesterday.asia)
echo "$current_date Count For Asia TlD $a"
echo "$current_date Overall Count For Asia TlD $c"
echo "$current_date New Registration Domain Counts $((c - a))"
echo "$current_date Deleted Domain Counts $((c - b))"
exec >&6 6>&-
cat $LOG | mail -s "Asia Tld Count log" [email protected]
在
awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia
这一部分中,我现在正在搜索如何获取不同的值,因此使用 AWK 的任何建议对我来说都更好。再次感谢您的立即回复。
I am new to AWK, I have some basic ideas in AWK. I want to remove duplicates in a file,
for example:
0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.
This is a sample file, from that using this command I got the output like this:
awk 'BEGIN{IGNORECASE=1}/^[^ ]+asia/ { gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[$1]++;}END{for (x in b)print x}'
0008.ASIA.
anish.asia.
ANISH.asia
But I want output like this
008.ASIA
anish.asia
or
008.ASIA
ANISH.asia
How do I remove these kind of duplicates?
Thanks in Advance
Anish kumar.V
Thanks for your immediate reponse, Actually I wrote a complete script in bash, now I am in final stage. How to invoke python in that :-(
#!/bin/bash
current_date=`date +%d-%m-%Y_%H.%M.%S`
today=`date +%d%m%Y`
yesterday=`date -d 'yesterday' '+%d%m%Y'`
RootPath=/var/domaincount/asia/
MainPath=$RootPath${today}asia
LOG=/var/tmp/log/asia/asiacount$current_date.log
mkdir -p $MainPath
echo Intelliscan Process started for Asia TLD $current_date
exec 6>&1 >> $LOG
#################################################################################################
## Using Wget Downloading the Zone files it will try only one time
if ! wget --tries=1 --ftp-user=USERNAME --ftp-password=PASSWORD ftp://ftp.anish.com:21/zonefile/anish.zone.gz
then
echo Download Not Success Domain count Failed With Error
exit 1
fi
###The downloaded file in Gunzip format from that we need to unzip and start the domain count process####
gunzip asia.zone.gz > $MainPath/$today.asia
###### It will start the Count #####
awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia
awk '/Total/ {print $2}' $RootPath/zonefile/$today.asia > $RootPath/$today.count
a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.asia $RootPath/zonefile/$yesterday.asia)
echo "$current_date Count For Asia TlD $a"
echo "$current_date Overall Count For Asia TlD $c"
echo "$current_date New Registration Domain Counts $((c - a))"
echo "$current_date Deleted Domain Counts $((c - b))"
exec >&6 6>&-
cat $LOG | mail -s "Asia Tld Count log" [email protected]
In that
awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia
in this part only now I am searching how to get the distinct values so any suggestions using AWK is better for me. Thanks again for your immediate response.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
顺便说一句,有趣的是,我在 http://www.unix.com/shell-programming-scripting/167512-using-awk-how-its-possible.html,然后在文件中添加新内容,然后我在这里添加了
tolower()
函数。 :Dbtw, it is interesting, that I gave you a solution at http://www.unix.com/shell-programming-scripting/167512-using-awk-how-its-possible.html, and you add something new in your file, then I added the
tolower()
function here. :D通过将 AWK 脚本放入单独的文件中,您可以了解到底发生了什么。这是解决“过滤掉重复项”问题的简单方法:
您可以通过传递
-f
选项 到awk
。如果上面的脚本不能被识别为 AWK 脚本,这里它是内联形式:
By putting your AWK script into a separate file, you can tell what's really going on. Here's a simple approach to your "filter out the duplicates" problem:
You can run AWK files by passing the
-f
option toawk
.If the above script isn't recognizable as an AWK script, here it is in inline form:
或者,只需使用 shell:
产生
Or, just use the shell:
produces
不要使用 AWK。使用 Python
可能比 AWK 更容易使用。是的。更长了。但这样可能更容易理解。
Don't use AWK. Use Python
That might be easier to work with than AWK. Yes. It's longer. But it may be easier to understand.
这是一个替代解决方案。让
sort
创建您的大小写折叠和uniq列表(并且它将被排序!)输出
编辑:将
sort -i -u
固定为<代码>排序-f -u。许多其他 UNIX 实用程序使用“-i”来指示“忽略大小写”。我的测试表明我需要修复它,但我忘记修复最终的帖子。Here is an alternative solution. Let
sort
create your cased-folded and uniq list (and it will be sorted!)output
Edit: fixed
sort -i -u
tosort -f -u
. Many other unix utilties use '-i' to indcate 'ignorecase'. My test showed me I need to fix it, and I forgot to fix the final posting.