从 ruby​​ 中的字符串中删除子域

发布于 2024-07-24 07:46:59 字数 248 浏览 6 评论 0原文

我正在循环访问一系列 URL,并希望清理它们。 我有以下代码:

# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])

# Remove www
new_url = o_url.host.gsub('www.', '').strip

如何扩展它以删除某些 URL 中存在的子域?

I'm looping over a series of URLs and want to clean them up. I have the following code:

# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])

# Remove www
new_url = o_url.host.gsub('www.', '').strip

How can I extend this to remove the subdomains that exist in some URLs?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

浮光之海 2024-07-31 07:46:59

我刚刚编写了一个名为 Domainatrix 的库来执行此操作。 您可以在这里找到它:http://github.com/pauldix/domainatrix

require 'rubygems'
require 'domainatrix'

url = Domainatrix.parse("http://www.pauldix.net")
url.public_suffix       # => "net"
url.domain    # => "pauldix"
url.canonical # => "net.pauldix"

url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix       # => "co.uk"
url.domain    # => "pauldix"
url.subdomain # => "foo.bar"
url.path      # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"

I just wrote a library to do this called Domainatrix. You can find it here: http://github.com/pauldix/domainatrix

require 'rubygems'
require 'domainatrix'

url = Domainatrix.parse("http://www.pauldix.net")
url.public_suffix       # => "net"
url.domain    # => "pauldix"
url.canonical # => "net.pauldix"

url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix       # => "co.uk"
url.domain    # => "pauldix"
url.subdomain # => "foo.bar"
url.path      # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
只为守护你 2024-07-31 07:46:59

对于后代,这里是 2014 年 10 月的更新:

我正在寻找一个更新的依赖项来依赖,并找到了 public_suffix gem (RubyGems)(GitHub)。 它得到积极维护,并通过维护已知公共后缀列表来处理所有顶级域和嵌套子域问题。

与 URI.parse 结合用于剥离协议和路径,它的效果非常好:

❯❯❯ 2.1.2 ❯ PublicSuffix.parse(URI.parse('https://subdomain.google.co.uk/path/on/path').host).domain
=> "google.co.uk"

For posterity, here's an update from Oct 2014:

I was looking for a more up-to-date dependency to rely on and found the public_suffix gem (RubyGems) (GitHub). It's being actively maintained and handles all the top-level domain and nested-subdomain issues by maintaining a list of the known public suffixes.

In combination with URI.parse for stripping protocol and paths, it works really well:

❯❯❯ 2.1.2 ❯ PublicSuffix.parse(URI.parse('https://subdomain.google.co.uk/path/on/path').host).domain
=> "google.co.uk"
等往事风中吹 2024-07-31 07:46:59

这是一个棘手的问题。 某些顶级域名不接受二级域名注册。

比较 example.comexample.co.uk。 如果您简单地删除除最后两个域之外的所有内容,您最终会得到 example.comco.uk,这永远不会是意图。

Firefox 通过按有效顶级域进行过滤来解决此问题,并且它们维护一个 所有这些域。 更多信息请访问 publicsuffix.org

您可以使用此列表过滤除有效 TLD 旁边右侧的域之外的所有内容。 我不知道有哪个 Ruby 库可以做到这一点,但发布一个是个好主意!

更新:有C、Perl 和 PHP 库做这个的。 给定 C 版本,您可以创建 Ruby 扩展。 或者,您可以将代码移植到 Ruby。

This is a tricky issue. Some top-level domains do not accept registrations at the second level.

Compare example.com and example.co.uk. If you would simply strip everything except the last two domains, you would end up with example.com, and co.uk, which can never be the intention.

Firefox solves this by filtering by effective top-level domain, and they maintain a list of all these domains. More information at publicsuffix.org.

You can use this list filter out everything except the domain right next to the effective TLD. I don't know of any Ruby library that does this, but it would be a great idea to release one!

Update: there are C, Perl and PHP libraries that do this. Given the C version, you could create a Ruby extension. Alternatively, you could port the code to Ruby.

小傻瓜 2024-07-31 07:46:59

这里需要的正则表达式可能有点棘手,因为主机名可能无限复杂——您可能有多个子域(即 foo.bar.baz.com),或者顶级域 (TLD) 可以有多个部分(即 www.baz.co.uk)。

准备好应对复杂的正则表达式了吗? :)

re = /^(?:(?>[a-z0-9-]*\.)+?|)([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$/i
new_url = o_url.host.gsub(re, '\1').strip

让我们把它分成两部分。 ^(?:(?>[a-z0-9-]*\.)+?|) 将通过匹配一组或多组字符后跟一个点来收集子域(贪婪地,以便所有子域都在此处匹配)。 如果没有子域(例如 foo.com),则需要空替换。 ([a-z0-9-]+\.(?>[az]*(?>\.[az]{2})?))$ 将收集实际主机名并顶级域名 (TLD)。 它允许使用由一部分组成的 TLD(例如 .info、.com 或 .museum),或由两部分组成的 TLD,其中第二部分是两个字符(例如 .oh.us 或 .org.uk)。

我在以下示例中测试了此表达式:

foo.com => foo.com
www.foo.com => foo.com
bar.foo.com => foo.com
www.foo.ca => foo.ca
www.foo.co.uk => foo.co.uk
a.b.c.d.e.foo.com => foo.com
a.b.c.d.e.foo.co.uk => foo.co.uk

请注意,此正则表达式将无法正确匹配 TLD 具有两个以上“部分”的主机名!

The regular expression you'll need here can be a bit tricky, because, hostnames can be infinitely complex -- you could have multiple subdomains (ie. foo.bar.baz.com), or the top level domain (TLD) can have multiple parts (ie. www.baz.co.uk).

Ready for a complex regular expression? :)

re = /^(?:(?>[a-z0-9-]*\.)+?|)([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$/i
new_url = o_url.host.gsub(re, '\1').strip

Let's break this into two sections. ^(?:(?>[a-z0-9-]*\.)+?|) will collect subdomains, by matching one or more groups of characters followed by a dot (greedily, so that all subdomains are matched here). The empty alternation is needed in the case of no subdomain (such as foo.com). ([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$ will collect the actual hostname and the TLD. It allows either for a one-part TLD (like .info, .com or .museum), or a two part TLD where the second part is two characters (like .oh.us or .org.uk).

I tested this expression on the following samples:

foo.com => foo.com
www.foo.com => foo.com
bar.foo.com => foo.com
www.foo.ca => foo.ca
www.foo.co.uk => foo.co.uk
a.b.c.d.e.foo.com => foo.com
a.b.c.d.e.foo.co.uk => foo.co.uk

Note that this regex will not properly match hostnames that have more than two "parts" to the TLD!

嘿咻 2024-07-31 07:46:59

类似于:

def remove_subdomain(host)
    # Not complete. Add all root domain to regexp
    host.sub(/.*?([^.]+(\.com|\.co\.uk|\.uk|\.nl))$/, "\\1")
end

puts remove_subdomain("www.example.com") # -> example.com
puts remove_subdomain("www.company.co.uk") # -> company.co.uk
puts remove_subdomain("www.sub.domain.nl") # -> domain.nl

您仍然需要添加您认为是根域的所有(根)域。 因此,“.uk”可能是根域,但您可能希望将主机保留在“.co.uk”部分之前。

Something like:

def remove_subdomain(host)
    # Not complete. Add all root domain to regexp
    host.sub(/.*?([^.]+(\.com|\.co\.uk|\.uk|\.nl))$/, "\\1")
end

puts remove_subdomain("www.example.com") # -> example.com
puts remove_subdomain("www.company.co.uk") # -> company.co.uk
puts remove_subdomain("www.sub.domain.nl") # -> domain.nl

You still need to add all (root) domains you consider root domain. So '.uk' might be the root domain, but you probably want to keep the host just before the '.co.uk' part.

魔法唧唧 2024-07-31 07:46:59

一般来说,检测 URL 的子域并不是一件容易的事 - 如果您只考虑基本的子域,这很容易,但一旦进入国际领域,这就变得很棘手。

编辑:考虑诸如http://mylocalschool.k12.oh.us 等人。

Detecting the subdomain of a URL is non-trivial to do in a general sense - it's easy if you just consider the basic ones, but once you get into international territory this becomes tricky.

Edit: Consider stuff like http://mylocalschool.k12.oh.us et al.

以可爱出名 2024-07-31 07:46:59

为什么不直接去掉 .com 或 .co.uk,然后用“.”分割呢? 并获取最后一个元素?

some_url.host.sub(/(\.co\.uk|\.[^.]*)$/).split('.')[-1] + $1

不得不说感觉很hacky。 还有其他域名,例如 .co.uk 吗?

Why not just strip the .com or .co.uk and then split on '.' and get the last element?

some_url.host.sub(/(\.co\.uk|\.[^.]*)$/).split('.')[-1] + $1

Have to say it feels hacky. Are there any other domains like .co.uk?

浅语花开 2024-07-31 07:46:59

多年来,我在编写各种爬虫和爬虫时一直在努力解决这个问题。 我最喜欢解决这个问题的宝石是 Pete Gamache 的 FuzzyUrl: https://github.com/gamache/fuzzyurl . 它可用于 Ruby、JavaScript 和 Elixir。

I've wrestled with this a lot in writing various and sundry crawlers and scrapers over the years. My favorite gem for solving this is FuzzyUrl by Pete Gamache: https://github.com/gamache/fuzzyurl . Its available for Ruby, JavaScript and Elixir.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文