使用 Scala 在文本中查找单词对最优雅的方法是什么?

发布于 2024-11-24 02:57:26 字数 333 浏览 2 评论 0原文

给定单词对列表

val terms = ("word1a", "word1b") :: ("word2a", "word2b") :: ... :: Nil

Scala 中测试文本中是否至少有一个出现的最优雅的方法是什么?当测试遇到第一个匹配项时,应尽快终止。你会如何解决这个问题?

编辑:更准确地说,我想知道一对中的两个单词是否出现在文本中的某个位置(不一定按顺序)。如果列表中的一对是这种情况,则该方法应返回 true。没有必要返回匹配的对,如果有多个匹配对也并不重要。

Given a list of word pairs

val terms = ("word1a", "word1b") :: ("word2a", "word2b") :: ... :: Nil

What's the most elegant way in Scala to test if at least one of the pairs occur in a text? The test should terminate as quick as possible when it hits the first match. How would you solve that?

EDIT: To be more precise I want to know if both words of a pair appear somewhere (not necessarily in order) in the text. If that's the case for one of the pairs in the list the method should return true. It's not necessary that the matched pair is returned, neither it's important if more than one pair matches.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

一腔孤↑勇 2024-12-01 02:57:26
scala> val text = Set("blah1", "word2b", "blah2", "word2a")
text: scala.collection.immutable.Set[java.lang.String] = Set(blah1, word2b, blah2)

scala> terms.exists{case (a,b) => text(a) && text(b)}
res12: Boolean = true

编辑:请注意,使用集合来表示文本中的标记可以使 contains 中的查找更加高效。您不会想使用像列表这样的连续的东西。

编辑 2:更新以澄清要求!

编辑3:根据评论中的建议将 contains 更改为 apply

scala> val text = Set("blah1", "word2b", "blah2", "word2a")
text: scala.collection.immutable.Set[java.lang.String] = Set(blah1, word2b, blah2)

scala> terms.exists{case (a,b) => text(a) && text(b)}
res12: Boolean = true

EDIT: Note that using a set to represent the tokens in the text makes the lookup from the contains much more efficient. You wouldn't want to use something sequential like a List for that.

EDIT 2: Updated for clarification in requirement!

EDIT 3: changed contains to apply per the suggestion in the comment

无声静候 2024-12-01 02:57:26

编辑 - 似乎你的问题的含糊措辞意味着我回答了一个不同的问题

因为你本质上是在要求这两对中的任何一个;你不妨将所有这些压平成一个大集合。

val words = (Set.empty[String] /: terms) { case (s, (w1, w2)) => s + w1 + w2 }

那么你只需询问文本中是否存在这些单词:

text.split("\\s") exists words

这很快,因为我们可以使用Set的结构来快速查找文本中是否包含该单词;由于“存在”,它提前终止:

scala> val text = "blah1  blah2 word2b"
text: java.lang.String = blah1  blah2 word2b

如果您的文本很长,您可能希望它,以便延迟测试下一个单词计算,而不是预先将字符串拆分为子字符串:

scala> val Word = """\s*(.*)""".r
Word: scala.util.matching.Regex = \s*(.*)

scala> def strmWds(text : String) : Stream[String] = text match {
     | case Word(nxt) => val (word, rest) = nxt span (_ != ' '); word #:: strmWds(rest)
     | case _         => Stream.empty
     | }
strmWds: (text: String)Stream[String]

现在您可以:

scala> strmWds(text) exists words
res4: Boolean = true

scala> text.split("\\s") exists words
res3: Boolean = true

EDIT - seems like the ambiguous wording of your question means I answered a different question:

Because you are essentially asking for either of the pair; you might as well flatten all these into one big set.

val words = (Set.empty[String] /: terms) { case (s, (w1, w2)) => s + w1 + w2 }

Then you are just asking whether any of these exist in the text:

text.split("\\s") exists words

This is fast because we can use the structure of a Set to lookup quickly whether the word is contained in the text; it terminates early due to the "exists":

scala> val text = "blah1  blah2 word2b"
text: java.lang.String = blah1  blah2 word2b

In the case that your text is very long, you may wish to Stream it, so that the next word to test is lazily computed, rather than split the String into substrings up-front:

scala> val Word = """\s*(.*)""".r
Word: scala.util.matching.Regex = \s*(.*)

scala> def strmWds(text : String) : Stream[String] = text match {
     | case Word(nxt) => val (word, rest) = nxt span (_ != ' '); word #:: strmWds(rest)
     | case _         => Stream.empty
     | }
strmWds: (text: String)Stream[String]

Now you can:

scala> strmWds(text) exists words
res4: Boolean = true

scala> text.split("\\s") exists words
res3: Boolean = true
巷雨优美回忆 2024-12-01 02:57:26

我假设该对的两个元素都必须出现在文本中,但出现在哪里并不重要,出现哪对也并不重要。

我不确定这是最优雅的,但它还不错,如果您期望文本可能包含单词(因此您不需要阅读所有内容),并且如果您可以生成一个迭代器,一次会给你一个单词:

case class WordPair(one: String, two: String) {
  private[this] var found_one, found_two = false
  def check(s: String): Boolean = {
    if (s==one) found_one = true
    if (s==two) found_two == true
    found_one && found_two
  }
  def reset {
    found_one = false
    found_two = false
  }
}

val wordpairlist = terms.map { case (w1,w2) => WordPair(w1,w2) }

// May need to wordpairlist.foreach(_.reset) first, if you do this on multiple texts
text.iterator.exists(w => wordpairlist.exists(_.check(w)))

你可以通过将所有术语放入一个集合中来进一步改进,甚至不需要检查单词对列表,除非文本中的单词在该集合中。

如果您的意思是单词必须按顺序相邻出现,则应将 check 更改为

def check(s: String) = {
  if (found_one && s==two) found_two = true
  else if (s==one) { found_one = true; found_two = false }
  else found_two = false
  found_one && found_two
}

I'm assuming that both elements of the pair have to appear in the text, but it doesn't matter where, and it doesn't matter which pair appears.

I'm not sure this is the most elegant, but it's not bad, and it's fairly fast if you expect that the text probably has the words (and thus you don't need to read all of it), and if you can generate an iterator that will give you the words one at a time:

case class WordPair(one: String, two: String) {
  private[this] var found_one, found_two = false
  def check(s: String): Boolean = {
    if (s==one) found_one = true
    if (s==two) found_two == true
    found_one && found_two
  }
  def reset {
    found_one = false
    found_two = false
  }
}

val wordpairlist = terms.map { case (w1,w2) => WordPair(w1,w2) }

// May need to wordpairlist.foreach(_.reset) first, if you do this on multiple texts
text.iterator.exists(w => wordpairlist.exists(_.check(w)))

You could further improve things by putting all the terms in a set, and not even bothering to check the wordpairlist unless the word from the text was in that set.

If you mean that the words have to occur next to each other in order, you then should change check to

def check(s: String) = {
  if (found_one && s==two) found_two = true
  else if (s==one) { found_one = true; found_two = false }
  else found_two = false
  found_one && found_two
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文