从 ruby 中的字符串中删除子域
我正在循环访问一系列 URL,并希望清理它们。 我有以下代码:
# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])
# Remove www
new_url = o_url.host.gsub('www.', '').strip
如何扩展它以删除某些 URL 中存在的子域?
I'm looping over a series of URLs and want to clean them up. I have the following code:
# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])
# Remove www
new_url = o_url.host.gsub('www.', '').strip
How can I extend this to remove the subdomains that exist in some URLs?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我刚刚编写了一个名为 Domainatrix 的库来执行此操作。 您可以在这里找到它:http://github.com/pauldix/domainatrix
I just wrote a library to do this called Domainatrix. You can find it here: http://github.com/pauldix/domainatrix
对于后代,这里是 2014 年 10 月的更新:
我正在寻找一个更新的依赖项来依赖,并找到了 public_suffix gem (RubyGems)(GitHub)。 它得到积极维护,并通过维护已知公共后缀列表来处理所有顶级域和嵌套子域问题。
与 URI.parse 结合用于剥离协议和路径,它的效果非常好:
For posterity, here's an update from Oct 2014:
I was looking for a more up-to-date dependency to rely on and found the public_suffix gem (RubyGems) (GitHub). It's being actively maintained and handles all the top-level domain and nested-subdomain issues by maintaining a list of the known public suffixes.
In combination with URI.parse for stripping protocol and paths, it works really well:
这是一个棘手的问题。 某些顶级域名不接受二级域名注册。
比较
example.com
和example.co.uk
。 如果您简单地删除除最后两个域之外的所有内容,您最终会得到example.com
和co.uk
,这永远不会是意图。Firefox 通过按有效顶级域进行过滤来解决此问题,并且它们维护一个 所有这些域。 更多信息请访问 publicsuffix.org。
您可以使用此列表过滤除有效 TLD 旁边右侧的域之外的所有内容。 我不知道有哪个 Ruby 库可以做到这一点,但发布一个是个好主意!
更新:有C、Perl 和 PHP 库做这个的。 给定 C 版本,您可以创建 Ruby 扩展。 或者,您可以将代码移植到 Ruby。
This is a tricky issue. Some top-level domains do not accept registrations at the second level.
Compare
example.com
andexample.co.uk
. If you would simply strip everything except the last two domains, you would end up withexample.com
, andco.uk
, which can never be the intention.Firefox solves this by filtering by effective top-level domain, and they maintain a list of all these domains. More information at publicsuffix.org.
You can use this list filter out everything except the domain right next to the effective TLD. I don't know of any Ruby library that does this, but it would be a great idea to release one!
Update: there are C, Perl and PHP libraries that do this. Given the C version, you could create a Ruby extension. Alternatively, you could port the code to Ruby.
这里需要的正则表达式可能有点棘手,因为主机名可能无限复杂——您可能有多个子域(即 foo.bar.baz.com),或者顶级域 (TLD) 可以有多个部分(即 www.baz.co.uk)。
准备好应对复杂的正则表达式了吗? :)
让我们把它分成两部分。
^(?:(?>[a-z0-9-]*\.)+?|)
将通过匹配一组或多组字符后跟一个点来收集子域(贪婪地,以便所有子域都在此处匹配)。 如果没有子域(例如 foo.com),则需要空替换。([a-z0-9-]+\.(?>[az]*(?>\.[az]{2})?))$
将收集实际主机名并顶级域名 (TLD)。 它允许使用由一部分组成的 TLD(例如 .info、.com 或 .museum),或由两部分组成的 TLD,其中第二部分是两个字符(例如 .oh.us 或 .org.uk)。我在以下示例中测试了此表达式:
请注意,此正则表达式将无法正确匹配 TLD 具有两个以上“部分”的主机名!
The regular expression you'll need here can be a bit tricky, because, hostnames can be infinitely complex -- you could have multiple subdomains (ie. foo.bar.baz.com), or the top level domain (TLD) can have multiple parts (ie. www.baz.co.uk).
Ready for a complex regular expression? :)
Let's break this into two sections.
^(?:(?>[a-z0-9-]*\.)+?|)
will collect subdomains, by matching one or more groups of characters followed by a dot (greedily, so that all subdomains are matched here). The empty alternation is needed in the case of no subdomain (such as foo.com).([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$
will collect the actual hostname and the TLD. It allows either for a one-part TLD (like .info, .com or .museum), or a two part TLD where the second part is two characters (like .oh.us or .org.uk).I tested this expression on the following samples:
Note that this regex will not properly match hostnames that have more than two "parts" to the TLD!
类似于:
您仍然需要添加您认为是根域的所有(根)域。 因此,“.uk”可能是根域,但您可能希望将主机保留在“.co.uk”部分之前。
Something like:
You still need to add all (root) domains you consider root domain. So '.uk' might be the root domain, but you probably want to keep the host just before the '.co.uk' part.
一般来说,检测 URL 的子域并不是一件容易的事 - 如果您只考虑基本的子域,这很容易,但一旦进入国际领域,这就变得很棘手。
编辑:考虑诸如http://mylocalschool.k12.oh.us 等人。
Detecting the subdomain of a URL is non-trivial to do in a general sense - it's easy if you just consider the basic ones, but once you get into international territory this becomes tricky.
Edit: Consider stuff like http://mylocalschool.k12.oh.us et al.
为什么不直接去掉 .com 或 .co.uk,然后用“.”分割呢? 并获取最后一个元素?
不得不说感觉很hacky。 还有其他域名,例如 .co.uk 吗?
Why not just strip the .com or .co.uk and then split on '.' and get the last element?
Have to say it feels hacky. Are there any other domains like .co.uk?
多年来,我在编写各种爬虫和爬虫时一直在努力解决这个问题。 我最喜欢解决这个问题的宝石是 Pete Gamache 的 FuzzyUrl: https://github.com/gamache/fuzzyurl . 它可用于 Ruby、JavaScript 和 Elixir。
I've wrestled with this a lot in writing various and sundry crawlers and scrapers over the years. My favorite gem for solving this is FuzzyUrl by Pete Gamache: https://github.com/gamache/fuzzyurl . Its available for Ruby, JavaScript and Elixir.