从带有类型的字符串中提取 url

发布于 2024-12-10 01:09:02 字数 911 浏览 1 评论 0原文

我正在尝试从字符串中提取网址,它们没有标准化,因此有些位于 href 标签内,其他则位于自己的标签内。

另外我需要它们按类型排序,例如以下字符串:

var txt1: String = "Some text! <a href="http://www.google.com/test.mp3">MP3</a>"
var txt2: String = "Some text! <a href="http://www.google.com/test.jpg">IMG</a>"
var txt3: String = "Some more! <a href="http://www.google.com/">Link!</a>"

所以这些字符串都是串联的并包含 3 个 url,我正在寻找类似的内容:

var result: List = List(

    "mp3" -> List("http://www.google.com/test.mp3"),
    "img" -> List("http://www.google.com/test.jpg"),
    "url" -> List("http://www.google.com/")
)

我已经研究过正则表达式,但只有这样至于在不定义类型的情况下提取 href,并且这也不会在标签之外自行检索 url

val hrefRegex = new Regex("""\<a.*?href=\"(http:.*?)\".*?\>.*?\</a>""");
val hrefs:List[String]= hrefRegex.findAllIn(txt1.mkString).toList;

非常感谢任何帮助,提前感谢:)

I'm attempting to extract urls from a string, they aren't standardized so some are within href tags, others on their own.

Also I need them to be sorted by type, so for example the following strings:

var txt1: String = "Some text! <a href="http://www.google.com/test.mp3">MP3</a>"
var txt2: String = "Some text! <a href="http://www.google.com/test.jpg">IMG</a>"
var txt3: String = "Some more! <a href="http://www.google.com/">Link!</a>"

So these strings are all concatenated and contain 3 urls, I'm looking for something along the lines of:

var result: List = List(

    "mp3" -> List("http://www.google.com/test.mp3"),
    "img" -> List("http://www.google.com/test.jpg"),
    "url" -> List("http://www.google.com/")
)

I've looked into regex but have only go so far as to extract hrefs without defining types, and this also doesn't retrieve urls on their own outside of tags

val hrefRegex = new Regex("""\<a.*?href=\"(http:.*?)\".*?\>.*?\</a>""");
val hrefs:List[String]= hrefRegex.findAllIn(txt1.mkString).toList;

Any help is much appreciated, thanks in advance :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

眉目亦如画i 2024-12-17 01:09:02

假设 val txt = txt1 + txt2 + txt3,您可以将文本作为字符串包装到 xml 元素中,然后将其解析为 XML,并使用 xml 标准库提取锚点。

// can do other cleanup if necessary here such as changing "link!"
def normalize(t: String) = t.toLowerCase()

val txtAsXML = xml.XML.loadString("<root>" + txt + "</root>")
val anchors = txtAsXML \\ "a"
// returns scala.xml.NodeSeq containing the <a> tags

然后你只需要进行后期处理,直到你按照你想要的方式组织数据:

val tuples = anchors.map(a => normalize(a.text) -> a.attributes("href").toString)
// Seq[String, String] containing elements
// like "mp3" -> http://www.google.com/test.mp3

val byTypes = tuples.groupBy(_._1).mapValues(seq => seq.map(_._2))
// here grouped by types:
// Map(img -> List(http://www.google.com/test.jpg), 
//     link! -> List(http://www.google.com/),
//     mp3 -> List(http://www.google.com/test.mp3))

Assuming val txt = txt1 + txt2 + txt3, you can wrap the text into an xml element as a string then parse it as XML and use the xml standard library to extract the anchors.

// can do other cleanup if necessary here such as changing "link!"
def normalize(t: String) = t.toLowerCase()

val txtAsXML = xml.XML.loadString("<root>" + txt + "</root>")
val anchors = txtAsXML \\ "a"
// returns scala.xml.NodeSeq containing the <a> tags

Then you just need to post process until you have the data organized like you want:

val tuples = anchors.map(a => normalize(a.text) -> a.attributes("href").toString)
// Seq[String, String] containing elements
// like "mp3" -> http://www.google.com/test.mp3

val byTypes = tuples.groupBy(_._1).mapValues(seq => seq.map(_._2))
// here grouped by types:
// Map(img -> List(http://www.google.com/test.jpg), 
//     link! -> List(http://www.google.com/),
//     mp3 -> List(http://www.google.com/test.mp3))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文