使用正则表达式从列表中删除重复的域

发布于 2024-08-22 01:39:53 字数 360 浏览 7 评论 0原文

我想使用 PCRE 获取 URI 列表并提取它。

开始：

http://abcd.tld/products/widget1       
http://abcd.tld/products/widget2    
http://abcd.tld/products/review    
http://1234.tld/

结束：

http://abcd.tld/products/widget1
http://1234.tld/

亲爱的 StackOverflow 成员，有什么想法吗？

原文

I'd like to use PCRE to take a list of URI's and distill it.

Start:

http://abcd.tld/products/widget1       
http://abcd.tld/products/widget2    
http://abcd.tld/products/review    
http://1234.tld/

Finish:

http://abcd.tld/products/widget1
http://1234.tld/

Any ideas, dear members of StackOverflow?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

且行且努力 2024-08-29 01:39:53

您可以使用简单的工具，例如 uniq。

请参阅评论中 kobi 的示例：

grep -o "^[^/]*//[^/]*/" urls.txt | sort | uniq

You can you simple tools like uniq.

See kobi's example in the comments:

grep -o "^[^/]*//[^/]*/" urls.txt | sort | uniq

回复收藏 0 原文

⒈起吃苦の倖褔 2024-08-29 01:39:53

虽然它的效率非常低，但它可以完成......

(?<!^http://\2/.*?$.*)^(http://(.*?)/.*?$)

请不要使用这个

While it's INSANELY inefficient, it can be done...

(?<!^http://\2/.*?$.*)^(http://(.*?)/.*?$)

Please don't use this

回复收藏 0 原文

悲凉≈ 2024-08-29 01:39:53

使用 URI 库解析出域，然后将其插入到哈希中。您将覆盖该哈希中已存在的任何 URL，这样您最终将获得唯一的链接。

这是一个 Ruby 示例：

require 'uri'

unique_links = {}

links.each do |l|
  u = URI.parse(l)
  unique_links[u.host] = l
end

unique_links.values # returns an Array of the unique links

Parse out the domain using a URI library, then insert it into a hash. You'll write over any URL that exists in that hash already so you'll end up with unique links.

Here's a Ruby example:

require 'uri'

unique_links = {}

links.each do |l|
  u = URI.parse(l)
  unique_links[u.host] = l
end

unique_links.values # returns an Array of the unique links

回复收藏 0 原文

向日葵 2024-08-29 01:39:53

如果您可以将整个文件作为单个字符串而不是逐行使用，那么为什么不应该像这样工作呢？（我不确定字符范围。）

s!(\w+://[a-zA-Z0-9.]+/\S+/)([^ /]+)\n(\1[^ /]+\n)+!\1\2!

If you can work with the whole file as a single string, rather than line-by-line, then why shouldn't something like this work. (I'm not sure about the char ranges.)

s!(\w+://[a-zA-Z0-9.]+/\S+/)([^ /]+)\n(\1[^ /]+\n)+!\1\2!

回复收藏 0 原文

看春风乍起 2024-08-29 01:39:53

如果您的系统

awk -F"/" '{
 s=$1
 for(i=2;i<NF;i++){ s=s"/"$i }
 if( !(s in a) ){ a[s]=$NF }
}
END{
    for(i in a) print i"/"a[i]
} ' file

输出上有 (g)awk

$ ./shell.sh
http://abcd.tld/products/widget1
http://1234.tld/

if you have (g)awk on your system

awk -F"/" '{
 s=$1
 for(i=2;i<NF;i++){ s=s"/"$i }
 if( !(s in a) ){ a[s]=$NF }
}
END{
    for(i in a) print i"/"a[i]
} ' file

output

$ ./shell.sh
http://abcd.tld/products/widget1
http://1234.tld/

回复收藏 0 原文

~没有更多了~

关于作者

夏末的微笑

暂无简介

0 文章

0 评论

24 人气

关注发私信

╭⌒浅淡时光〆

文章 0 评论 0

关注

慕巷

文章 0 评论 0

关注

浅生活

文章 0 评论 0

关注

bal

文章 0 评论 0

关注

lqwuliang

文章 0 评论 0

关注

后来的我们

文章 0 评论 0

友情链接

文江博客

使用正则表达式从列表中删除重复的域

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

╭⌒浅淡时光〆

慕巷

浅生活

bal

lqwuliang

后来的我们

友情链接

使用正则表达式从列表中删除重复的域

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

╭⌒浅淡时光〆

慕巷

浅生活

bal

lqwuliang

后来的我们

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。