需要从sub-domans删除域域

发布于 2025-02-10 00:56:03 字数 561 浏览 1 评论 0原文

我试图从cut命令从右到左的最后2个值，

我有一个大约1.1亿个域和子域的大数据库。

就像

yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk

用简单的话说，我试图从域中删除子域，

echo a.yahoo.aa | cut -d '.' -f 2,3
yahoo.aa

但是当我尝试时，

echo yahoo.aa | cut -d '.' -f 2,3
aa

只会给我aa

所需的输出是

yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk

编辑感谢Anubhava的建议。

tld属性就像

xxxx.xx
xxx.xx
xx.xx

cctld始终具有2个字符。

原文

I am trying to get last 2 values from right to left from cut command

I have a large database for about 110 Million domains and subdomains.

yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk

In simple words I am trying to remove subdomains from domains

echo a.yahoo.aa | cut -d '.' -f 2,3
yahoo.aa

but when I try

echo yahoo.aa | cut -d '.' -f 2,3
aa

it give me only aa

Required output is

yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk

edit thanks anubhava for suggestion.

a TLD property is like

xxxx.xx
xxx.xx
xx.xx

i.e. a ccTLD always has 2 characters in last.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尹雨沫 2025-02-17 00:56:03

长期解决方案，但想到您想做的事情：

可执行文件domain.awk：

#! /usr/bin/awk -f

BEGIN {
    FS="."
}
{
    ret = $NF
    if (NF >= 2 && (length($(NF - 1)) == 2 || length($(NF - 1)) == 3)) {
        ret = $(NF - 1) "." ret
        if (NF >= 3) {
            ret = $(NF - 2) "." ret
        }
    } else if (NF >= 2) {
        ret = $(NF - 1) "." ret
    }
    print ret
}

with domains.lst file：

yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk
aus.co.au

like this：output：output：

./domain.awk domains.lst

output： output：

yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
aus.co.au

Long solution but a think that makes what you want to do:

Executable file domain.awk:

#! /usr/bin/awk -f

BEGIN {
    FS="."
}
{
    ret = $NF
    if (NF >= 2 && (length($(NF - 1)) == 2 || length($(NF - 1)) == 3)) {
        ret = $(NF - 1) "." ret
        if (NF >= 3) {
            ret = $(NF - 2) "." ret
        }
    } else if (NF >= 2) {
        ret = $(NF - 1) "." ret
    }
    print ret
}

with domains.lst file:

yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk
aus.co.au

Used like that:

./domain.awk domains.lst

Output:

yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
aus.co.au

回复收藏 0 原文

不甘平庸 2025-02-17 00:56:03

使用您提供的示例输入，并接受您的陈述，即cctld始终具有最后2个字符。是您打印最后3个的标准，而不是输入的最后2个段：

使用GNU GREP for <代码> -o ：

$ grep -Eo '[^.]+\.[^.]+(\.[^.]{2})?
或使用任何尴尬：
$ awk 'match($0,/[^.]+\.[^.]+(\.[^.]{2})?$/){print substr($0,RSTART)}' file
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk

 file
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk

或使用任何尴尬：

Using the sample input you provided and accepting your statement that a ccTLD always has 2 characters in last. as being your criteria for printing the last 3 instead of last 2 segments of the input:

Using GNU grep for -o:

$ grep -Eo '[^.]+\.[^.]+(\.[^.]{2})?
or using any awk:
$ awk 'match($0,/[^.]+\.[^.]+(\.[^.]{2})?$/){print substr($0,RSTART)}' file
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk

 file
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk

or using any awk:

回复收藏 0 原文

開玄 2025-02-17 00:56:03

尝试

echo a.yahoo.aa | awk -F'.' '{print $NF"."$(NF-1)}'

Try

echo a.yahoo.aa | awk -F'.' '{print $NF"."$(NF-1)}'

回复收藏 0 原文

不知所踪 2025-02-17 00:56:03

大约1.1亿个域和子域的大数据库。

因此，我建议在此处使用sed，让file.txt contents

yahoo.com
mail.yahoo.com
a.yahoo.com

输出

sed 's/^.*\.\([^.]*\.[^.]*\)$/\1/' file.txt

start

yahoo.com
yahoo.com
yahoo.com

说明：在跨越整行的正则表达式中（^ - start- ，$ end）我使用单个捕获组，该组包含零或摩尔（*）非点，然后是字面的点（\。）随后是零或摩尔的非点，与线的结尾相邻，我用该组的内容代替了整行。免责声明：此解决方案假定每行中始终至少有一个点

（在GNU SED 4.2.2中进行了测试）

large database for about 110 Million domains and subdomains.

Due to this I suggest using sed here, let file.txt content be

yahoo.com
mail.yahoo.com
a.yahoo.com

then

sed 's/^.*\.\([^.]*\.[^.]*\)$/\1/' file.txt

output

yahoo.com
yahoo.com
yahoo.com

Explanation: In regular expression spanning whole line (^-start, $-end) I use single capturing group which contain zero-or-more (*) non-dots followed by literal dot (\.) followed by zero-or-more non-dots which is adjacent to end of line, I replace whole line with content of that group. Disclaimer: this solution assumes there is always at least one dot in each line

(tested in GNU sed 4.2.2)

回复收藏 0 原文