需要从sub-domans删除域域

发布于 2025-02-10 00:56:03 字数 561 浏览 1 评论 0原文

我试图从cut命令从右到左的最后2个值,

我有一个大约1.1亿个域和子域的大数据库。

就像

yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk

用简单的话说,我试图从域中删除子域,

echo a.yahoo.aa | cut -d '.' -f 2,3
yahoo.aa

但是当我尝试时,

echo yahoo.aa | cut -d '.' -f 2,3
aa

只会给我aa

所需的输出是

yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk

编辑感谢Anubhava的建议。

tld属性就像

xxxx.xx
xxx.xx
xx.xx

cctld始终具有2个字符。

I am trying to get last 2 values from right to left from cut command

I have a large database for about 110 Million domains and subdomains.

Like

yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk

In simple words I am trying to remove subdomains from domains

echo a.yahoo.aa | cut -d '.' -f 2,3
yahoo.aa

but when I try

echo yahoo.aa | cut -d '.' -f 2,3
aa

it give me only aa

Required output is

yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk

edit thanks anubhava for suggestion.

a TLD property is like

xxxx.xx
xxx.xx
xx.xx

i.e. a ccTLD always has 2 characters in last.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

尹雨沫 2025-02-17 00:56:03

长期解决方案,但想到您想做的事情:

可执行文件domain.awk

#! /usr/bin/awk -f

BEGIN {
    FS="."
}
{
    ret = $NF
    if (NF >= 2 && (length($(NF - 1)) == 2 || length($(NF - 1)) == 3)) {
        ret = $(NF - 1) "." ret
        if (NF >= 3) {
            ret = $(NF - 2) "." ret
        }
    } else if (NF >= 2) {
        ret = $(NF - 1) "." ret
    }
    print ret
}

with domains.lst file:

yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk
aus.co.au

like this:output:output:

./domain.awk domains.lst

output: output:

yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
aus.co.au

Long solution but a think that makes what you want to do:

Executable file domain.awk:

#! /usr/bin/awk -f

BEGIN {
    FS="."
}
{
    ret = $NF
    if (NF >= 2 && (length($(NF - 1)) == 2 || length($(NF - 1)) == 3)) {
        ret = $(NF - 1) "." ret
        if (NF >= 3) {
            ret = $(NF - 2) "." ret
        }
    } else if (NF >= 2) {
        ret = $(NF - 1) "." ret
    }
    print ret
}

with domains.lst file:

yahoo.com
mail.yahoo.com
a.yahoo.com
a.yahoo.co.uk
aus.co.au

Used like that:

./domain.awk domains.lst

Output:

yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
aus.co.au
不甘平庸 2025-02-17 00:56:03

使用您提供的示例输入,并接受您的陈述,即cctld始终具有最后2个字符。是您打印最后3个的标准,而不是输入的最后2个段:

使用GNU GREP for <代码> -o :

$ grep -Eo '[^.]+\.[^.]+(\.[^.]{2})?

或使用任何尴尬:

$ awk 'match($0,/[^.]+\.[^.]+(\.[^.]{2})?$/){print substr($0,RSTART)}' file
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
file yahoo.com yahoo.com yahoo.com yahoo.co.uk

或使用任何尴尬:

Using the sample input you provided and accepting your statement that a ccTLD always has 2 characters in last. as being your criteria for printing the last 3 instead of last 2 segments of the input:

Using GNU grep for -o:

$ grep -Eo '[^.]+\.[^.]+(\.[^.]{2})?

or using any awk:

$ awk 'match($0,/[^.]+\.[^.]+(\.[^.]{2})?$/){print substr($0,RSTART)}' file
yahoo.com
yahoo.com
yahoo.com
yahoo.co.uk
file yahoo.com yahoo.com yahoo.com yahoo.co.uk

or using any awk:

開玄 2025-02-17 00:56:03

尝试

echo a.yahoo.aa | awk -F'.' '{print $NF"."$(NF-1)}'

Try

echo a.yahoo.aa | awk -F'.' '{print $NF"."$(NF-1)}'
不知所踪 2025-02-17 00:56:03

大约1.1亿个域和子域的大数据库。

因此,我建议在此处使用sed,让file.txt contents

yahoo.com
mail.yahoo.com
a.yahoo.com

输出

sed 's/^.*\.\([^.]*\.[^.]*\)$/\1/' file.txt

start

yahoo.com
yahoo.com
yahoo.com

说明:在跨越整行的正则表达式中(^ - start- ,$ end)我使用单个捕获组,该组包含零或摩尔(*)非点,然后是字面的点(\。)随后是零或摩尔的非点,与线的结尾相邻,我用该组的内容代替了整行。 免责声明:此解决方案假定每行中始终至少有一个点

(在GNU SED 4.2.2中进行了测试)

large database for about 110 Million domains and subdomains.

Due to this I suggest using sed here, let file.txt content be

yahoo.com
mail.yahoo.com
a.yahoo.com

then

sed 's/^.*\.\([^.]*\.[^.]*\)$/\1/' file.txt

output

yahoo.com
yahoo.com
yahoo.com

Explanation: In regular expression spanning whole line (^-start, $-end) I use single capturing group which contain zero-or-more (*) non-dots followed by literal dot (\.) followed by zero-or-more non-dots which is adjacent to end of line, I replace whole line with content of that group. Disclaimer: this solution assumes there is always at least one dot in each line

(tested in GNU sed 4.2.2)

梦晓ヶ微光ヅ倾城 2025-02-17 00:56:03

您仅选择字段2和3。您需要从字段2到末端进行选择:

 ... | cut -d '.' -f 2-

You are selecting only fields 2 and 3. You need to select from field 2 up to the end:

 ... | cut -d '.' -f 2-
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文