如何在Java中查找字符串中的整个单词?

发布于 2024-10-18 23:48:12 字数 464 浏览 9 评论 0 原文

我有一个字符串,我必须解析它以查找不同的关键字。 例如,我有字符串:

“我会来 123woods 见你”

我的关键词是

'123woods'
'树林'

每当我有比赛、在哪里比赛时我都应该报道。还应考虑多次发生的情况。

然而,对于这个,我应该只在'123woods'上获得匹配,而不是在'woods'上获得匹配。这消除了使用 String.contains() 方法。另外,我应该能够拥有一个关键字列表/集合,并同时检查它们的出现情况。在此示例中,如果我有 '123woods''come',我应该会出现两次。对于大文本,方法执行应该稍微快一些。

我的想法是使用 StringTokenizer 但我不确定它是否会表现良好。有什么建议吗?

I have a String that I have to parse for different keywords.
For example, I have the String:

"I will come and meet you at the 123woods"

And my keywords are

'123woods'
'woods'

I should report whenever I have a match and where. Multiple occurrences should also be accounted for.

However, for this one, I should get a match only on '123woods', not on 'woods'. This eliminates using String.contains() method. Also, I should be able to have a list/set of keywords and check at the same time for their occurrence. In this example, if I have '123woods' and 'come', I should get two occurrences. Method execution should be somewhat fast on large texts.

My idea is to use StringTokenizer but I am unsure if it will perform well. Any suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

那些过往 2024-10-25 23:48:12

以下示例基于您的评论。它使用关键字列表,将使用单词边界在给定字符串中搜索该关键字。它使用 Apache Commons Lang 中的 StringUtils 来构建正则表达式并打印匹配的组。

String text = "I will come and meet you at the woods 123woods and all the woods";

List<String> tokens = new ArrayList<String>();
tokens.add("123woods");
tokens.add("woods");

String patternString = "\\b(" + StringUtils.join(tokens, "|") + ")\\b";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(text);

while (matcher.find()) {
    System.out.println(matcher.group(1));
}

如果您正在寻找更高的性能,您可以看看StringSearch:高性能模式匹配算法在爪哇。

The example below is based on your comments. It uses a List of keywords, which will be searched in a given String using word boundaries. It uses StringUtils from Apache Commons Lang to build the regular expression and print the matched groups.

String text = "I will come and meet you at the woods 123woods and all the woods";

List<String> tokens = new ArrayList<String>();
tokens.add("123woods");
tokens.add("woods");

String patternString = "\\b(" + StringUtils.join(tokens, "|") + ")\\b";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(text);

while (matcher.find()) {
    System.out.println(matcher.group(1));
}

If you are looking for more performance, you could have a look at StringSearch: high-performance pattern matching algorithms in Java.

云胡 2024-10-25 23:48:12

正如其他人回答的那样,使用正则表达式+单词边界。

"I will come and meet you at the 123woods".matches(".*\\b123woods\\b.*");

将会是真的。

"I will come and meet you at the 123woods".matches(".*\\bwoods\\b.*");

将会是假的。

Use regex + word boundaries as others answered.

"I will come and meet you at the 123woods".matches(".*\\b123woods\\b.*");

will be true.

"I will come and meet you at the 123woods".matches(".*\\bwoods\\b.*");

will be false.

半仙 2024-10-25 23:48:12

希望这对您有用:

String string = "I will come and meet you at the 123woods";
String keyword = "123woods";

Boolean found = Arrays.asList(string.split(" ")).contains(keyword);
if(found){
      System.out.println("Keyword matched the string");
}

http://codigounico.blogspot.com/

Hope this works for you:

String string = "I will come and meet you at the 123woods";
String keyword = "123woods";

Boolean found = Arrays.asList(string.split(" ")).contains(keyword);
if(found){
      System.out.println("Keyword matched the string");
}

http://codigounico.blogspot.com/

三生路 2024-10-25 23:48:12

像 Arrays.asList(String.split(" ")).contains("xx") 这样的东西怎么样?

请参阅 String.split()如何测试数组是否包含某个值

How about something like Arrays.asList(String.split(" ")).contains("xx")?

See String.split() and How can I test if an array contains a certain value.

浮世清欢 2024-10-25 23:48:12

在Android中找到了一种从字符串匹配精确单词的方法:

String full = "Hello World. How are you ?";

String one = "Hell";
String two = "Hello";
String three = "are";
String four = "ar";


boolean is1 = isContainExactWord(full, one);
boolean is2 = isContainExactWord(full, two);
boolean is3 = isContainExactWord(full, three);
boolean is4 = isContainExactWord(full, four);

Log.i("Contains Result", is1+"-"+is2+"-"+is3+"-"+is4);

Result: false-true-true-false

匹配单词的函数:

private boolean isContainExactWord(String fullString, String partWord){
    String pattern = "\\b"+partWord+"\\b";
    Pattern p=Pattern.compile(pattern);
    Matcher m=p.matcher(fullString);
    return m.find();
}

完成

Got a way to match Exact word from String in Android:

String full = "Hello World. How are you ?";

String one = "Hell";
String two = "Hello";
String three = "are";
String four = "ar";


boolean is1 = isContainExactWord(full, one);
boolean is2 = isContainExactWord(full, two);
boolean is3 = isContainExactWord(full, three);
boolean is4 = isContainExactWord(full, four);

Log.i("Contains Result", is1+"-"+is2+"-"+is3+"-"+is4);

Result: false-true-true-false

Function for match word:

private boolean isContainExactWord(String fullString, String partWord){
    String pattern = "\\b"+partWord+"\\b";
    Pattern p=Pattern.compile(pattern);
    Matcher m=p.matcher(fullString);
    return m.find();
}

Done

自此以后,行同陌路 2024-10-25 23:48:12

尝试使用正则表达式进行匹配。匹配“\b123wood\b”,\b 是一个分词符。

Try to match using regular expressions. Match for "\b123wood\b", \b is a word break.

遮云壑 2024-10-25 23:48:12

该解决方案似乎早已被接受,但该解决方案还可以改进,因此如果有人遇到类似的问题:

这是多模式搜索算法的经典应用。

Java 模式搜索(使用 Matcher.find)不适合这样做。在 java 中对精确搜索一个关键字进行了优化,搜索 or 表达式使用正则表达式非确定性自动机,该自动机在不匹配时进行回溯。在最坏的情况下,文本的每个字符将被处理 l 次(其中 l 是模式长度的总和)。

单一模式搜索更好,但也不合格。人们必须开始对每个关键字模式进行整体搜索。在最坏的情况下,文本的每个字符将被处理 p 次,其中 p 是模式的数量。

多模式搜索将仅处理文本的每个字符一次。适合这种搜索的算法有 Aho-Corasick、Wu-Manber 或 Set Backwards Oracle Matching。这些可以在 Stringsearchalgorithmsbyteseek

// example with StringSearchAlgorithms

AhoCorasick stringSearch = new AhoCorasick(asList("123woods", "woods"));

CharProvider text = new StringCharProvider("I will come and meet you at the woods 123woods and all the woods", 0);

StringFinder finder = stringSearch.createFinder(text);

List<StringMatch> all = finder.findAll();

The solution seems to be long accepted, but the solution could be improved, so if someone has a similar problem:

This is a classical application for multi-pattern-search-algorithms.

Java Pattern Search (with Matcher.find) is not qualified for doing that. Searching for exactly one keyword is optimized in java, searching for an or-expression uses the regex non deterministic automaton which is backtracking on mismatches. In worse case each character of the text will be processed l times (where l is the sum of the pattern lengths).

Single pattern search is better, but not qualified, too. One will have to start the whole search for every keyword pattern. In worse case each character of the text will be processed p times where p is the number of patterns.

Multi pattern search will process each character of the text exactly once. Algorithms suitable for such a search would be Aho-Corasick, Wu-Manber, or Set Backwards Oracle Matching. These could be found in libraries like Stringsearchalgorithms or byteseek.

// example with StringSearchAlgorithms

AhoCorasick stringSearch = new AhoCorasick(asList("123woods", "woods"));

CharProvider text = new StringCharProvider("I will come and meet you at the woods 123woods and all the woods", 0);

StringFinder finder = stringSearch.createFinder(text);

List<StringMatch> all = finder.findAll();
楠木可依 2024-10-25 23:48:12
public class FindTextInLine {
    String match = "123woods";
    String text = "I will come and meet you at the 123woods";

    public void findText () {
        if (text.contains(match)) {
            System.out.println("Keyword matched the string" );
        }
    }
}
public class FindTextInLine {
    String match = "123woods";
    String text = "I will come and meet you at the 123woods";

    public void findText () {
        if (text.contains(match)) {
            System.out.println("Keyword matched the string" );
        }
    }
}
樱花落人离去 2024-10-25 23:48:12

您可以使用正则表达式。
使用 Matcher 和 Pattern 方法获得所需的输出

You can use regular expressions.
Use Matcher and Pattern methods to get the desired output

゛时过境迁 2024-10-25 23:48:12

您还可以使用正则表达式与 \b 标志(整个单词边界)匹配。

You can also use regex matching with the \b flag (whole word boundary).

乖乖公主 2024-10-25 23:48:12

一种更简单的方法是使用 split():

String match = "123woods";
String text = "I will come and meet you at the 123woods";

String[] sentence = text.split();
for(String word: sentence)
{
    if(word.equals(match))
        return true;
}
return false;

这是一种更简单、不太优雅的方法,可以在不使用令牌等的情况下完成相同的操作。

A much simpler way to do this is to use split():

String match = "123woods";
String text = "I will come and meet you at the 123woods";

String[] sentence = text.split();
for(String word: sentence)
{
    if(word.equals(match))
        return true;
}
return false;

This is a simpler, less elegant way to do the same thing without using tokens, etc.

夏末 2024-10-25 23:48:12

要匹配 "123woods" 而不是 "woods" ,请在正则表达式中使用原子分组。
需要注意的一点是,在单独匹配“123woods”的字符串中,它将匹配第一个“123woods”并退出,而不是进一步搜索同一字符串。

\b(?>123woods|woods)\b

它会搜索 123woods 作为主要搜索,一旦匹配,它就会退出搜索。

To Match "123woods" instead of "woods" , use atomic grouping in regular expresssion.
One thing to be noted is that, in a string to match "123woods" alone , it will match the first "123woods" and exits instead of searching the same string further.

\b(?>123woods|woods)\b

it searches 123woods as primary search, once it got matched it exits the search.

作业与我同在 2024-10-25 23:48:12

回顾原来的问题,我们需要在给定的句子中找到一些给定的关键字,计算出现的次数并知道出现在哪里。我不太明白“where”是什么意思(它是句子中的索引吗?),所以我会通过这个...我还在学习java,一次一步,所以我会看看 必须注意的是

,常见句子(如原问题中的句子)可能有重复的关键字,因此搜索不能只是询问给定关键字“是否存在”并将其计为 1如果它确实存在的话。可以有多个相同的。例如:

// Base sentence (added punctuation, to make it more interesting):
String sentence = "Say that 123 of us will come by and meet you, "
                + "say, at the woods of 123woods.";

// Split it (punctuation taken in consideration, as well):
java.util.List<String> strings = 
                       java.util.Arrays.asList(sentence.split(" |,|\\."));

// My keywords:
java.util.ArrayList<String> keywords = new java.util.ArrayList<>();
keywords.add("123woods");
keywords.add("come");
keywords.add("you");
keywords.add("say");

通过观察,“Say”+“come”+“you”+“say”+“123woods”的预期结果将为 5,如果我们采用小写,则将“say”计数两次。如果不这样做,则计数应为 4,排除“Say”并包括“say”。美好的。我的建议是:

// Set... ready...?
int counter = 0;

// Go!
for(String s : strings)
{
    // Asking if the sentence exists in the keywords, not the other
    // around, to find repeated keywords in the sentence.
    Boolean found = keywords.contains(s.toLowerCase());
    if(found)
    {
        counter ++;
        System.out.println("Found: " + s);
    }
}

// Statistics:
if (counter > 0)
{
    System.out.println("In sentence: " + sentence + "\n"
                     + "Count: " + counter);
}

结果是:

找到:Say
发现:来
找到:你
发现:说
已找到:123woods
In句子: 假设我们中有 123 个人会过来和你见面,比如说,在 123woods 的树林里。
数量:5

Looking back at the original question, we need to find some given keywords in a given sentence, count the number of occurrences and know something about where. I don't quite understand what "where" means (is it an index in the sentence?), so I'll pass that one... I'm still learning java, one step at a time, so I'll see to that one in due time :-)

It must be noticed that common sentences (as the one in the original question) can have repeated keywords, therefore the search cannot just ask if a given keyword "exists or not" and count it as 1 if it does exist. There can be more then one of the same. For example:

// Base sentence (added punctuation, to make it more interesting):
String sentence = "Say that 123 of us will come by and meet you, "
                + "say, at the woods of 123woods.";

// Split it (punctuation taken in consideration, as well):
java.util.List<String> strings = 
                       java.util.Arrays.asList(sentence.split(" |,|\\."));

// My keywords:
java.util.ArrayList<String> keywords = new java.util.ArrayList<>();
keywords.add("123woods");
keywords.add("come");
keywords.add("you");
keywords.add("say");

By looking at it, the expected result would be 5 for "Say" + "come" + "you" + "say" + "123woods", counting "say" twice if we go lowercase. If we don't, then the count should be 4, "Say" being excluded and "say" included. Fine. My suggestion is:

// Set... ready...?
int counter = 0;

// Go!
for(String s : strings)
{
    // Asking if the sentence exists in the keywords, not the other
    // around, to find repeated keywords in the sentence.
    Boolean found = keywords.contains(s.toLowerCase());
    if(found)
    {
        counter ++;
        System.out.println("Found: " + s);
    }
}

// Statistics:
if (counter > 0)
{
    System.out.println("In sentence: " + sentence + "\n"
                     + "Count: " + counter);
}

And the results are:

Found: Say
Found: come
Found: you
Found: say
Found: 123woods
In sentence: Say that 123 of us will come by and meet you, say, at the woods of 123woods.
Count: 5

风透绣罗衣 2024-10-25 23:48:12

如果您想识别字符串中的整个单词并更改该单词的内容,您可以这样做。除了您处理的单词之外,您的最终字符串保持相等。在这种情况下,“not”在最终字符串中保留为“'not'”。

    StringBuilder sb = new StringBuilder();
    String[] splited = value.split("\\s+");
    if(ArrayUtils.isNotEmpty(splited)) {
        for(String valor : splited) {
            sb.append(" ");
            if("not".equals(valor.toLowerCase())) {
                sb.append("'").append(valor).append("'");
            } else {
                sb.append(valor);
            }               
        }
    }
    return sb.toString();

If you want to identify a whole word in a string and change the content of that word you can do this way. Your final string stays equals, except the word you treated. In this case "not" stays "'not'" in final string.

    StringBuilder sb = new StringBuilder();
    String[] splited = value.split("\\s+");
    if(ArrayUtils.isNotEmpty(splited)) {
        for(String valor : splited) {
            sb.append(" ");
            if("not".equals(valor.toLowerCase())) {
                sb.append("'").append(valor).append("'");
            } else {
                sb.append(valor);
            }               
        }
    }
    return sb.toString();
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文