从 ruby 中的字符串中删除子域

发布于 2024-07-24 07:46:59 字数 248 浏览 11 评论 0原文

我正在循环访问一系列 URL，并希望清理它们。我有以下代码：

# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])

# Remove www
new_url = o_url.host.gsub('www.', '').strip

如何扩展它以删除某些 URL 中存在的子域？

原文

I'm looping over a series of URLs and want to clean them up. I have the following code:

# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])

# Remove www
new_url = o_url.host.gsub('www.', '').strip

How can I extend this to remove the subdomains that exist in some URLs?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浮光之海 2024-07-31 07:46:59

我刚刚编写了一个名为 Domainatrix 的库来执行此操作。您可以在这里找到它：http://github.com/pauldix/domainatrix

require 'rubygems'
require 'domainatrix'

url = Domainatrix.parse("http://www.pauldix.net")
url.public_suffix       # => "net"
url.domain    # => "pauldix"
url.canonical # => "net.pauldix"

url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix       # => "co.uk"
url.domain    # => "pauldix"
url.subdomain # => "foo.bar"
url.path      # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"

I just wrote a library to do this called Domainatrix. You can find it here: http://github.com/pauldix/domainatrix

require 'rubygems'
require 'domainatrix'

url = Domainatrix.parse("http://www.pauldix.net")
url.public_suffix       # => "net"
url.domain    # => "pauldix"
url.canonical # => "net.pauldix"

url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix       # => "co.uk"
url.domain    # => "pauldix"
url.subdomain # => "foo.bar"
url.path      # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"

回复收藏 0 原文

只为守护你 2024-07-31 07:46:59

对于后代，这里是 2014 年 10 月的更新：

我正在寻找一个更新的依赖项来依赖，并找到了 public_suffix gem (RubyGems）（GitHub）。它得到积极维护，并通过维护已知公共后缀列表来处理所有顶级域和嵌套子域问题。

与 URI.parse 结合用于剥离协议和路径，它的效果非常好：

❯❯❯ 2.1.2 ❯ PublicSuffix.parse(URI.parse('https://subdomain.google.co.uk/path/on/path').host).domain
=> "google.co.uk"

For posterity, here's an update from Oct 2014:

I was looking for a more up-to-date dependency to rely on and found the public_suffix gem (RubyGems) (GitHub). It's being actively maintained and handles all the top-level domain and nested-subdomain issues by maintaining a list of the known public suffixes.

In combination with URI.parse for stripping protocol and paths, it works really well:

❯❯❯ 2.1.2 ❯ PublicSuffix.parse(URI.parse('https://subdomain.google.co.uk/path/on/path').host).domain
=> "google.co.uk"

回复收藏 0 原文

等往事风中吹 2024-07-31 07:46:59

这是一个棘手的问题。某些顶级域名不接受二级域名注册。

比较 example.com 和 example.co.uk。如果您简单地删除除最后两个域之外的所有内容，您最终会得到 example.com 和 co.uk，这永远不会是意图。

Firefox 通过按有效顶级域进行过滤来解决此问题，并且它们维护一个所有这些域。更多信息请访问 publicsuffix.org。

您可以使用此列表过滤除有效 TLD 旁边右侧的域之外的所有内容。我不知道有哪个 Ruby 库可以做到这一点，但发布一个是个好主意！

更新：有C、Perl 和 PHP 库做这个的。给定 C 版本，您可以创建 Ruby 扩展。或者，您可以将代码移植到 Ruby。

回复收藏 0 原文

小傻瓜 2024-07-31 07:46:59

这里需要的正则表达式可能有点棘手，因为主机名可能无限复杂——您可能有多个子域（即 foo.bar.baz.com），或者顶级域 (TLD) 可以有多个部分（即 www.baz.co.uk）。

准备好应对复杂的正则表达式了吗？ :)

re = /^(?:(?>[a-z0-9-]*\.)+?|)([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$/i
new_url = o_url.host.gsub(re, '\1').strip

让我们把它分成两部分。 ^(?:(?>[a-z0-9-]*\.)+?|) 将通过匹配一组或多组字符后跟一个点来收集子域（贪婪地，以便所有子域都在此处匹配）。如果没有子域（例如 foo.com），则需要空替换。 ([a-z0-9-]+\.(?>[az]*(?>\.[az]{2})?))$ 将收集实际主机名并顶级域名 (TLD)。它允许使用由一部分组成的 TLD（例如 .info、.com 或 .museum），或由两部分组成的 TLD，其中第二部分是两个字符（例如 .oh.us 或 .org.uk）。

我在以下示例中测试了此表达式：

foo.com => foo.com
www.foo.com => foo.com
bar.foo.com => foo.com
www.foo.ca => foo.ca
www.foo.co.uk => foo.co.uk
a.b.c.d.e.foo.com => foo.com
a.b.c.d.e.foo.co.uk => foo.co.uk

请注意，此正则表达式将无法正确匹配 TLD 具有两个以上“部分”的主机名！

The regular expression you'll need here can be a bit tricky, because, hostnames can be infinitely complex -- you could have multiple subdomains (ie. foo.bar.baz.com), or the top level domain (TLD) can have multiple parts (ie. www.baz.co.uk).

Ready for a complex regular expression? :)

re = /^(?:(?>[a-z0-9-]*\.)+?|)([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$/i
new_url = o_url.host.gsub(re, '\1').strip

Let's break this into two sections. ^(?:(?>[a-z0-9-]*\.)+?|) will collect subdomains, by matching one or more groups of characters followed by a dot (greedily, so that all subdomains are matched here). The empty alternation is needed in the case of no subdomain (such as foo.com). ([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$ will collect the actual hostname and the TLD. It allows either for a one-part TLD (like .info, .com or .museum), or a two part TLD where the second part is two characters (like .oh.us or .org.uk).

I tested this expression on the following samples:

foo.com => foo.com
www.foo.com => foo.com
bar.foo.com => foo.com
www.foo.ca => foo.ca
www.foo.co.uk => foo.co.uk
a.b.c.d.e.foo.com => foo.com
a.b.c.d.e.foo.co.uk => foo.co.uk

Note that this regex will not properly match hostnames that have more than two "parts" to the TLD!

回复收藏 0 原文

嘿咻 2024-07-31 07:46:59

类似于：

def remove_subdomain(host)
    # Not complete. Add all root domain to regexp
    host.sub(/.*?([^.]+(\.com|\.co\.uk|\.uk|\.nl))$/, "\\1")
end

puts remove_subdomain("www.example.com") # -> example.com
puts remove_subdomain("www.company.co.uk") # -> company.co.uk
puts remove_subdomain("www.sub.domain.nl") # -> domain.nl

您仍然需要添加您认为是根域的所有（根）域。因此，“.uk”可能是根域，但您可能希望将主机保留在“.co.uk”部分之前。

Something like:

def remove_subdomain(host)
    # Not complete. Add all root domain to regexp
    host.sub(/.*?([^.]+(\.com|\.co\.uk|\.uk|\.nl))$/, "\\1")
end

puts remove_subdomain("www.example.com") # -> example.com
puts remove_subdomain("www.company.co.uk") # -> company.co.uk
puts remove_subdomain("www.sub.domain.nl") # -> domain.nl

You still need to add all (root) domains you consider root domain. So '.uk' might be the root domain, but you probably want to keep the host just before the '.co.uk' part.

回复收藏 0 原文

魔法唧唧 2024-07-31 07:46:59

一般来说，检测 URL 的子域并不是一件容易的事 - 如果您只考虑基本的子域，这很容易，但一旦进入国际领域，这就变得很棘手。

编辑：考虑诸如http://mylocalschool.k12.oh.us 等人。

回复收藏 0 原文

以可爱出名 2024-07-31 07:46:59

为什么不直接去掉 .com 或 .co.uk，然后用“.”分割呢？并获取最后一个元素？

some_url.host.sub(/(\.co\.uk|\.[^.]*)$/).split('.')[-1] + $1

不得不说感觉很hacky。还有其他域名，例如 .co.uk 吗？

Why not just strip the .com or .co.uk and then split on '.' and get the last element?

some_url.host.sub(/(\.co\.uk|\.[^.]*)$/).split('.')[-1] + $1

Have to say it feels hacky. Are there any other domains like .co.uk?

回复收藏 0 原文

浅语花开 2024-07-31 07:46:59

多年来，我在编写各种爬虫和爬虫时一直在努力解决这个问题。我最喜欢解决这个问题的宝石是 Pete Gamache 的 FuzzyUrl： https://github.com/gamache/fuzzyurl . 它可用于 Ruby、JavaScript 和 Elixir。

回复收藏 0 原文

~没有更多了~

关于作者

无力看清

暂无简介

0 文章

0 评论

25 人气

关注发私信

友情链接

文江博客

从 ruby 中的字符串中删除子域

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

爱人如己

萧瑟寒风

云雾

倒带

浮世清欢

撩起发的微风

友情链接

从 ruby​​ 中的字符串中删除子域

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

爱人如己

萧瑟寒风

云雾

倒带

浮世清欢

撩起发的微风

友情链接

从 ruby 中的字符串中删除子域

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。