使用正则表达式和解析根域预先确定的 TLD 列表

发布于 2024-12-21 12:05:52 字数 900 浏览 1 评论 0原文

我想使用正则表达式来解析给定输入 URL 的根域。我已经知道，在给定适当的输入 URL 的情况下，基本上没有不能“破坏”的正则表达式，这就是为什么我想将给定正则表达式的使用限制为给定 TLD 的列表（如果可能的话））。下面是一个示例：

假设我有一个输入文件，并将通过正则表达式一次运行文件中的每个 URL。这是输入文件：

www.google.co.uk
www.google.co.uk/something
www.google.com/
www.google.com/something
google.com/
google.com/something
subdomain.google.com/
subdomain.google.com/something
www.subdomain.google.com/
www.google.net/
www.google.net/something
google.net/

最终结果应该是这样的：

google.co.uk
google.co.uk
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.com

不过，我想要的重要的事情是让正则表达式根据以下内容进行解析：

从给定 TLD 列表中查找给定 URL 中的 TLD（例如：

(co.uk|com|net|edu|gov|etc|etc|etc)

如果找到给定的 TLD 之一，则匹配并解析找到的 TLD 左侧（包括）的所有内容，直到它到达行的开头或到达另一个“。

”要编写一个基于给定“伪代码”描述进行匹配的正则表达式，它应该完全按照所示方式解析出示例输入数据。

原文

I'd like to use a RegEx to parse out the root domain of a given input URL. I already know that there is basically no RegEx out there that can't be "broken" given the appropriate input URL, which is why I'd like to restrict the usage of a given RegEx to a list of given TLD's (if it's possible). Here is an example:

Lets say I've got an input file and will be running each URL in the file through the regex one at a time. Here is the input file:

www.google.co.uk
www.google.co.uk/something
www.google.com/
www.google.com/something
google.com/
google.com/something
subdomain.google.com/
subdomain.google.com/something
www.subdomain.google.com/
www.google.net/
www.google.net/something
google.net/

The final result, should be this:

google.co.uk
google.co.uk
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.com

The important thing I'd like though, is for the regex to parse based on the following:

Find the TLD in the given URL from a list of given TLDs (for instance:

(co.uk|com|net|edu|gov|etc|etc|etc)

IF one of the given TLD's is found THEN match & parse out everything to the left of (and including) that TLD that it found, UP UNTIL it either reaches the beginning of the line OR it reaches another "."

If it's possible to write a regex that matches based on that "pseudo-code" description given, it should parse out the sample input data exactly as shown.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迷离° 2024-12-28 12:05:52

perl -ne 'print $2, "\n" if m-^([^/]+?\.|)([^./]*\.(co\.uk|com|net|edu|gov|etc|etc|etc))(/.*|)$-'  /tmp/x.txt

似乎给出了您正在寻找的结果，至少在您提供的示例数据上（假设您不想将 google.net 翻译为 google.com ）。

请注意，我确实对 [^./] 有点懒，它可能会匹配域名中不合法的字符。话又说回来，i18n 可能重写了 DNS 规则，包含比我年轻时更多的字符。

perl -ne 'print $2, "\n" if m-^([^/]+?\.|)([^./]*\.(co\.uk|com|net|edu|gov|etc|etc|etc))(/.*|)$-'  /tmp/x.txt

seems to give the results you are looking for, at least on the sample data you provided (assuming you don't want to translate google.net to google.com ).

Note that I did get a little lazy with my [^./], which could match characters which are not legal in domain names. Then again, i18n has probably rewritten the rules for DNS to include a lot more characters than when I was young.

回复收藏 0 原文

夏见 2024-12-28 12:05:52

在 Java 中：

package test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {

    /**
     * @param args
     */
    public static void main(String[] args) {

        String subject = "www.google.co.uk\nwww.google.co.uk/something\nwww.google.com/\nwww.google.com/something\ngoogle.com/\ngoogle.com/something\nsubdomain.google.com/\nsubdomain.google.com/something\nwww.subdomain.google.com/\nwww.google.net/\nwww.google.net/something\ngoogle.net/\n";
        Pattern pattern = Pattern.compile("(\\w+)\\.(co.uk|com|net|edu|gov)");

        Matcher m = pattern.matcher(subject);
        int count = 0;
           while(m.find()) {
               count++;
               System.out.println(m.group());
          }
    }
}

正则表达式 = (\w+)\.(co.uk|com|net|edu|gov)

In Java :

package test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {

    /**
     * @param args
     */
    public static void main(String[] args) {

        String subject = "www.google.co.uk\nwww.google.co.uk/something\nwww.google.com/\nwww.google.com/something\ngoogle.com/\ngoogle.com/something\nsubdomain.google.com/\nsubdomain.google.com/something\nwww.subdomain.google.com/\nwww.google.net/\nwww.google.net/something\ngoogle.net/\n";
        Pattern pattern = Pattern.compile("(\\w+)\\.(co.uk|com|net|edu|gov)");

        Matcher m = pattern.matcher(subject);
        int count = 0;
           while(m.find()) {
               count++;
               System.out.println(m.group());
          }
    }
}

Regex = (\w+)\.(co.uk|com|net|edu|gov)

回复收藏 0 原文

屌丝范 2024-12-28 12:05:52

实际上，由于多种原因，无法使用正则表达式来解析 uri。例如，localhost、192.168.0.43、www.google.co.uk 均有效。

但是，如果您提取“.”之前的最后一个元素，则您不希望 IP 地址中的“43”作为 TLD，有许多例外情况（co.uk 和 bl.uk 有两种不同的行为）。

我在那里编写了一个 C 库/Python 绑定和命令行工具： http://www.github.com/ stricaud/faup 所以你可以做这样的事情：

$ faup -p www.example.com
scheme,credential,subdomain,domain,host,tld,port,resource_path,query_string,fragment
,,www,example.com,www.example.com,com,,,,

要获取域名，你可以有一个包含所有域名的文件，然后通过 faup 运行它：

$ cat urls.txt |faup -f domain
google.co.uk
google.co.uk
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.net
google.net
google.net

如果你只想要 tld，你可以使用 -f tld 参数，例如：

$ faup -f tld www.example.com
com

或者甚至，得到json 输出：

$ faup -o json http://www.test.co.uk/index.html?foo=bar#tagada
{
    "scheme": "http",
    "credential": "",
    "subdomain": "www",
    "domain": "test.co.uk",
    "host": "www.test.co.uk",
    "tld": "co.uk",
    "port": "",
    "resource_path": "/index.html",
    "query_string": "?foo=bar",
    "fragment": "#tagada"
}

这不仅比正则表达式更快，而且可以处理您在此处想要执行像域/顶级域名提取这样简单的操作时遇到的所有特定情况。

Actually there is no way to parse an uri using a regex for lots of reasons. For exemple, localhost, 192.168.0.43, www.google.co.uk are all valid.

However, if you extract the last element before the '.', you don't want '43' from your IP address as a TLD, there there are many exceptions (co.uk and bl.uk have two different behaviors).

I wrote a C library/Python bindings and command line tool available there: http://www.github.com/stricaud/faup so you can do things like:

$ faup -p www.example.com
scheme,credential,subdomain,domain,host,tld,port,resource_path,query_string,fragment
,,www,example.com,www.example.com,com,,,,

To get the domain, you can have a file with all of them, and run it through faup:

$ cat urls.txt |faup -f domain
google.co.uk
google.co.uk
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.net
google.net
google.net

if you just want the tld, you can use the -f tld parameter, such as:

$ faup -f tld www.example.com
com

Or even, get a json output:

$ faup -o json http://www.test.co.uk/index.html?foo=bar#tagada
{
    "scheme": "http",
    "credential": "",
    "subdomain": "www",
    "domain": "test.co.uk",
    "host": "www.test.co.uk",
    "tld": "co.uk",
    "port": "",
    "resource_path": "/index.html",
    "query_string": "?foo=bar",
    "fragment": "#tagada"
}

Not only this is faster than a regex, but that deals with all the specific cases you encounter whenever you want to do things as simple as domain/tld extraction as you want here.

回复收藏 0 原文

~没有更多了~