使用AWK如何删除这些重复项？

发布于 2024-12-05 07:56:11 字数 2833 浏览 1 评论 0原文

我是AWK的新手，我对AWK有一些基本的想法。我想删除文件中的重复项，例如：

    0008.ASIA. NS AS2.DNS.ASIA.CN.
    0008.ASIA. NS AS2.DNS.ASIA.CN.
    ns1.0008.asia. NS AS2.DNS.ASIA.CN.
    www.0008.asia. NS AS2.DNS.ASIA.CN.
    anish.asia NS AS2.DNS.ASIA.CN.
    ns2.anish.asia NS AS2.DNS.ASIA.CN
    ANISH.asia. NS AS2.DNS.ASIA.CN.

这是一个示例文件，使用此命令我得到如下输出：

awk 'BEGIN{IGNORECASE=1}/^[^ ]+asia/ { gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[$1]++;}END{for (x in b)print x}'

0008.亚洲。
anish.asia.
ANISH.asia

但我想要这样的输出

  008.ASIA
  anish.asia

或

008.ASIA
ANISH.asia

如何删除这些重复项？

提前致谢 Anish kumar.V

感谢您的立即回复，实际上我用 bash 编写了完整的脚本，现在处于最后阶段。如何在其中调用 python :-(

#!/bin/bash

current_date=`date +%d-%m-%Y_%H.%M.%S`
today=`date +%d%m%Y`
yesterday=`date -d 'yesterday' '+%d%m%Y'`
RootPath=/var/domaincount/asia/
MainPath=$RootPath${today}asia
LOG=/var/tmp/log/asia/asiacount$current_date.log

mkdir -p $MainPath
echo Intelliscan Process started for Asia TLD $current_date 

exec 6>&1 >> $LOG

#################################################################################################
## Using Wget Downloading the Zone files it will try only one time
if ! wget --tries=1 --ftp-user=USERNAME --ftp-password=PASSWORD ftp://ftp.anish.com:21/zonefile/anish.zone.gz
then
    echo Download Not Success Domain count Failed With Error
    exit 1
fi
###The downloaded file in Gunzip format from that we need to unzip and start the domain count process####
gunzip asia.zone.gz > $MainPath/$today.asia

###### It will start the Count #####
awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia
awk '/Total/ {print $2}' $RootPath/zonefile/$today.asia > $RootPath/$today.count

a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.asia $RootPath/zonefile/$yesterday.asia)

echo "$current_date Count For Asia TlD $a"
echo "$current_date Overall Count For Asia TlD $c"
echo "$current_date New Registration Domain Counts $((c - a))"
echo "$current_date Deleted Domain Counts $((c - b))"

exec >&6 6>&-
cat $LOG | mail -s "Asia Tld Count log" [email protected]

在

 awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia

这一部分中，我现在正在搜索如何获取不同的值，因此使用 AWK 的任何建议对我来说都更好。再次感谢您的立即回复。

原文

I am new to AWK, I have some basic ideas in AWK. I want to remove duplicates in a file,
for example:

    0008.ASIA. NS AS2.DNS.ASIA.CN.
    0008.ASIA. NS AS2.DNS.ASIA.CN.
    ns1.0008.asia. NS AS2.DNS.ASIA.CN.
    www.0008.asia. NS AS2.DNS.ASIA.CN.
    anish.asia NS AS2.DNS.ASIA.CN.
    ns2.anish.asia NS AS2.DNS.ASIA.CN
    ANISH.asia. NS AS2.DNS.ASIA.CN.

This is a sample file, from that using this command I got the output like this:

awk 'BEGIN{IGNORECASE=1}/^[^ ]+asia/ { gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[$1]++;}END{for (x in b)print x}'

0008.ASIA.
anish.asia.
ANISH.asia

But I want output like this

  008.ASIA
  anish.asia

008.ASIA
ANISH.asia

How do I remove these kind of duplicates?

Thanks in Advance
Anish kumar.V

Thanks for your immediate reponse, Actually I wrote a complete script in bash, now I am in final stage. How to invoke python in that :-(

#!/bin/bash

current_date=`date +%d-%m-%Y_%H.%M.%S`
today=`date +%d%m%Y`
yesterday=`date -d 'yesterday' '+%d%m%Y'`
RootPath=/var/domaincount/asia/
MainPath=$RootPath${today}asia
LOG=/var/tmp/log/asia/asiacount$current_date.log

mkdir -p $MainPath
echo Intelliscan Process started for Asia TLD $current_date 

exec 6>&1 >> $LOG

#################################################################################################
## Using Wget Downloading the Zone files it will try only one time
if ! wget --tries=1 --ftp-user=USERNAME --ftp-password=PASSWORD ftp://ftp.anish.com:21/zonefile/anish.zone.gz
then
    echo Download Not Success Domain count Failed With Error
    exit 1
fi
###The downloaded file in Gunzip format from that we need to unzip and start the domain count process####
gunzip asia.zone.gz > $MainPath/$today.asia

###### It will start the Count #####
awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia
awk '/Total/ {print $2}' $RootPath/zonefile/$today.asia > $RootPath/$today.count

a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.asia $RootPath/zonefile/$yesterday.asia)

echo "$current_date Count For Asia TlD $a"
echo "$current_date Overall Count For Asia TlD $c"
echo "$current_date New Registration Domain Counts $((c - a))"
echo "$current_date Deleted Domain Counts $((c - b))"

exec >&6 6>&-
cat $LOG | mail -s "Asia Tld Count log" [email protected]

In that

 awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia

in this part only now I am searching how to get the distinct values so any suggestions using AWK is better for me. Thanks again for your immediate response.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沩ん囻菔务 2024-12-12 07:56:11

kent$  cat a
0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.


kent$  awk -F' NS' '{ gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[tolower($1)]++;}END{for (x in b)print x}' a
anish.asia
0008.asia

顺便说一句，有趣的是，我在 http://www.unix.com/shell-programming-scripting/167512-using-awk-how-its-possible.html，然后在文件中添加新内容，然后我在这里添加了 tolower() 函数。：D

kent$  cat a
0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.


kent$  awk -F' NS' '{ gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[tolower($1)]++;}END{for (x in b)print x}' a
anish.asia
0008.asia

btw, it is interesting, that I gave you a solution at http://www.unix.com/shell-programming-scripting/167512-using-awk-how-its-possible.html, and you add something new in your file, then I added the tolower() function here. :D

回复收藏 0 原文

一场信仰旅途 2024-12-12 07:56:11

通过将 AWK 脚本放入单独的文件中，您可以了解到底发生了什么。这是解决“过滤掉重复项”问题的简单方法：

# For each line in the file
{

  # Decide on a unique key (eg. case insensitive without trailing period)
  unique_key = tolower($1)
  sub(/\.$/, "", unique_key)

  # If this line isn't a duplicate (it hasn't been found yet)
  if (!(unique_key in already_found)) {

    # Mark this unique key as found
    already_found[unique_key] = "found"

    # Print out the relevant data
    print($1)
  }
}

您可以通过传递 -f 选项到 awk。

如果上面的脚本不能被识别为 AWK 脚本，这里它是内联形式：

awk '{ key = tolower($1); sub(/\.$/, "", key); if (!(key in found)) { found[key] = 1; print($1) } }'

By putting your AWK script into a separate file, you can tell what's really going on. Here's a simple approach to your "filter out the duplicates" problem:

# For each line in the file
{

  # Decide on a unique key (eg. case insensitive without trailing period)
  unique_key = tolower($1)
  sub(/\.$/, "", unique_key)

  # If this line isn't a duplicate (it hasn't been found yet)
  if (!(unique_key in already_found)) {

    # Mark this unique key as found
    already_found[unique_key] = "found"

    # Print out the relevant data
    print($1)
  }
}

You can run AWK files by passing the -f option to awk.

If the above script isn't recognizable as an AWK script, here it is in inline form:

awk '{ key = tolower($1); sub(/\.$/, "", key); if (!(key in found)) { found[key] = 1; print($1) } }'

回复收藏 0 原文

攒眉千度 2024-12-12 07:56:11

或者，只需使用 shell：

echo '    0008.ASIA. NS AS2.DNS.ASIA.CN.
    0008.ASIA. NS AS2.DNS.ASIA.CN.
    ns1.0008.asia. NS AS2.DNS.ASIA.CN.
    www.0008.asia. NS AS2.DNS.ASIA.CN.
    anish.asia NS AS2.DNS.ASIA.CN.
    ns2.anish.asia NS AS2.DNS.ASIA.CN
    ANISH.asia. NS AS2.DNS.ASIA.CN.' |
while read domain rest; do
    domain=${domain%.}
    case "$domain" in
        (*.*.*) : ;;
        (*.[aA][sS][iI][aA]) echo "$domain" ;;
    esac
done |
sort -fu

产生

0008.ASIA
anish.asia

Or, just use the shell:

echo '    0008.ASIA. NS AS2.DNS.ASIA.CN.
    0008.ASIA. NS AS2.DNS.ASIA.CN.
    ns1.0008.asia. NS AS2.DNS.ASIA.CN.
    www.0008.asia. NS AS2.DNS.ASIA.CN.
    anish.asia NS AS2.DNS.ASIA.CN.
    ns2.anish.asia NS AS2.DNS.ASIA.CN
    ANISH.asia. NS AS2.DNS.ASIA.CN.' |
while read domain rest; do
    domain=${domain%.}
    case "$domain" in
        (*.*.*) : ;;
        (*.[aA][sS][iI][aA]) echo "$domain" ;;
    esac
done |
sort -fu

produces

0008.ASIA
anish.asia

回复收藏 0 原文

浊酒尽余欢 2024-12-12 07:56:11

不要使用 AWK。使用 Python

import readlines
result= set()
for line in readlines:
    words = lines.split()
    if "asia" in words[0].lower():
        result.add( words[0].lower() )
for name in result:
    print name

可能比 AWK 更容易使用。是的。更长了。但这样可能更容易理解。

Don't use AWK. Use Python

import readlines
result= set()
for line in readlines:
    words = lines.split()
    if "asia" in words[0].lower():
        result.add( words[0].lower() )
for name in result:
    print name

That might be easier to work with than AWK. Yes. It's longer. But it may be easier to understand.

回复收藏 0 原文

葬シ愛 2024-12-12 07:56:11

这是一个替代解决方案。让sort创建您的大小写折叠和uniq列表（并且它将被排序！）

  {
   cat - <<EOS
   0008.ASIA. NS AS2.DNS.ASIA.CN.
   0008.ASIA. NS AS2.DNS.ASIA.CN.
   ns1.0008.asia. NS AS2.DNS.ASIA.CN.
   www.0008.asia. NS AS2.DNS.ASIA.CN.
   anish.asia NS AS2.DNS.ASIA.CN.
   ns2.anish.asia NS AS2.DNS.ASIA.CN
   ANISH.asia. NS AS2.DNS.ASIA.CN.

EOS
 } |   awk '{
      #dbg print "$0=" $0
      targ=$1
      sub(/\.$/, "", targ)
      n=split(targ,tmpArr,".")
      #dbg print "n="n
      if (n > 2) targ=tmpArr[n-1] "." tmpArr[n]
      print targ 
     }' \
 | sort -f -u

输出

0008.ASIA
anish.asia

编辑：将sort -i -u固定为<代码>排序-f -u。许多其他 UNIX 实用程序使用“-i”来指示“忽略大小写”。我的测试表明我需要修复它，但我忘记修复最终的帖子。

Here is an alternative solution. Let sort create your cased-folded and uniq list (and it will be sorted!)

  {
   cat - <<EOS
   0008.ASIA. NS AS2.DNS.ASIA.CN.
   0008.ASIA. NS AS2.DNS.ASIA.CN.
   ns1.0008.asia. NS AS2.DNS.ASIA.CN.
   www.0008.asia. NS AS2.DNS.ASIA.CN.
   anish.asia NS AS2.DNS.ASIA.CN.
   ns2.anish.asia NS AS2.DNS.ASIA.CN
   ANISH.asia. NS AS2.DNS.ASIA.CN.

EOS
 } |   awk '{
      #dbg print "$0=" $0
      targ=$1
      sub(/\.$/, "", targ)
      n=split(targ,tmpArr,".")
      #dbg print "n="n
      if (n > 2) targ=tmpArr[n-1] "." tmpArr[n]
      print targ 
     }' \
 | sort -f -u

output

0008.ASIA
anish.asia

Edit: fixed sort -i -u to sort -f -u. Many other unix utilties use '-i' to indcate 'ignorecase'. My test showed me I need to fix it, and I forgot to fix the final posting.

回复收藏 0 原文

~没有更多了~