从带有类型的字符串中提取 url
我正在尝试从字符串中提取网址,它们没有标准化,因此有些位于 href 标签内,其他则位于自己的标签内。
另外我需要它们按类型排序,例如以下字符串:
var txt1: String = "Some text! <a href="http://www.google.com/test.mp3">MP3</a>"
var txt2: String = "Some text! <a href="http://www.google.com/test.jpg">IMG</a>"
var txt3: String = "Some more! <a href="http://www.google.com/">Link!</a>"
所以这些字符串都是串联的并包含 3 个 url,我正在寻找类似的内容:
var result: List = List(
"mp3" -> List("http://www.google.com/test.mp3"),
"img" -> List("http://www.google.com/test.jpg"),
"url" -> List("http://www.google.com/")
)
我已经研究过正则表达式,但只有这样至于在不定义类型的情况下提取 href,并且这也不会在标签之外自行检索 url
val hrefRegex = new Regex("""\<a.*?href=\"(http:.*?)\".*?\>.*?\</a>""");
val hrefs:List[String]= hrefRegex.findAllIn(txt1.mkString).toList;
非常感谢任何帮助,提前感谢:)
I'm attempting to extract urls from a string, they aren't standardized so some are within href tags, others on their own.
Also I need them to be sorted by type, so for example the following strings:
var txt1: String = "Some text! <a href="http://www.google.com/test.mp3">MP3</a>"
var txt2: String = "Some text! <a href="http://www.google.com/test.jpg">IMG</a>"
var txt3: String = "Some more! <a href="http://www.google.com/">Link!</a>"
So these strings are all concatenated and contain 3 urls, I'm looking for something along the lines of:
var result: List = List(
"mp3" -> List("http://www.google.com/test.mp3"),
"img" -> List("http://www.google.com/test.jpg"),
"url" -> List("http://www.google.com/")
)
I've looked into regex but have only go so far as to extract hrefs without defining types, and this also doesn't retrieve urls on their own outside of tags
val hrefRegex = new Regex("""\<a.*?href=\"(http:.*?)\".*?\>.*?\</a>""");
val hrefs:List[String]= hrefRegex.findAllIn(txt1.mkString).toList;
Any help is much appreciated, thanks in advance :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
假设
val txt = txt1 + txt2 + txt3
,您可以将文本作为字符串包装到 xml 元素中,然后将其解析为 XML,并使用 xml 标准库提取锚点。然后你只需要进行后期处理,直到你按照你想要的方式组织数据:
Assuming
val txt = txt1 + txt2 + txt3
, you can wrap the text into an xml element as a string then parse it as XML and use the xml standard library to extract the anchors.Then you just need to post process until you have the data organized like you want: