使用 Scala 在文本中查找单词对最优雅的方法是什么?
给定单词对列表
val terms = ("word1a", "word1b") :: ("word2a", "word2b") :: ... :: Nil
Scala 中测试文本中是否至少有一个出现的最优雅的方法是什么?当测试遇到第一个匹配项时,应尽快终止。你会如何解决这个问题?
编辑:更准确地说,我想知道一对中的两个单词是否出现在文本中的某个位置(不一定按顺序)。如果列表中的一对是这种情况,则该方法应返回 true
。没有必要返回匹配的对,如果有多个匹配对也并不重要。
Given a list of word pairs
val terms = ("word1a", "word1b") :: ("word2a", "word2b") :: ... :: Nil
What's the most elegant way in Scala to test if at least one of the pairs occur in a text? The test should terminate as quick as possible when it hits the first match. How would you solve that?
EDIT: To be more precise I want to know if both words of a pair appear somewhere (not necessarily in order) in the text. If that's the case for one of the pairs in the list the method should return true
. It's not necessary that the matched pair is returned, neither it's important if more than one pair matches.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
编辑:请注意,使用集合来表示文本中的标记可以使
contains
中的查找更加高效。您不会想使用像列表这样的连续的东西。编辑 2:更新以澄清要求!
编辑3:根据评论中的建议将
contains
更改为apply
EDIT: Note that using a set to represent the tokens in the text makes the lookup from the
contains
much more efficient. You wouldn't want to use something sequential like a List for that.EDIT 2: Updated for clarification in requirement!
EDIT 3: changed
contains
toapply
per the suggestion in the comment编辑 - 似乎你的问题的含糊措辞意味着我回答了一个不同的问题:
因为你本质上是在要求这两对中的任何一个;你不妨将所有这些压平成一个大集合。
那么你只需询问文本中是否存在这些单词:
这很快,因为我们可以使用
Set
的结构来快速查找文本中是否包含该单词;由于“存在”,它提前终止:如果您的文本很长,您可能希望
流
它,以便延迟测试下一个单词计算,而不是预先将字符串拆分为子字符串:现在您可以:
EDIT - seems like the ambiguous wording of your question means I answered a different question:
Because you are essentially asking for either of the pair; you might as well flatten all these into one big set.
Then you are just asking whether any of these exist in the text:
This is fast because we can use the structure of a
Set
to lookup quickly whether the word is contained in the text; it terminates early due to the "exists":In the case that your text is very long, you may wish to
Stream
it, so that the next word to test is lazily computed, rather than split the String into substrings up-front:Now you can:
我假设该对的两个元素都必须出现在文本中,但出现在哪里并不重要,出现哪对也并不重要。
我不确定这是最优雅的,但它还不错,如果您期望文本可能包含单词(因此您不需要阅读所有内容),并且如果您可以生成一个迭代器,一次会给你一个单词:
你可以通过将所有术语放入一个集合中来进一步改进,甚至不需要检查单词对列表,除非文本中的单词在该集合中。
如果您的意思是单词必须按顺序相邻出现,则应将
check
更改为I'm assuming that both elements of the pair have to appear in the text, but it doesn't matter where, and it doesn't matter which pair appears.
I'm not sure this is the most elegant, but it's not bad, and it's fairly fast if you expect that the text probably has the words (and thus you don't need to read all of it), and if you can generate an iterator that will give you the words one at a time:
You could further improve things by putting all the terms in a set, and not even bothering to check the wordpairlist unless the word from the text was in that set.
If you mean that the words have to occur next to each other in order, you then should change
check
to