如何扩展WhitespaceTokenizer?

发布于 2024-12-06 12:49:18 字数 447 浏览 1 评论 0原文

我需要使用一个分词器来分割空格上的单词,但如果空格位于双括号内,则不会分割。这里有一个例子:

My input-> term1 term2 term3 ((term4 term5)) term6  

应该产生这个令牌列表:

term1, term2, term3, ((term4 term5)), term6.  

我认为我可以通过扩展 Lucene WhiteSpaceTokenizer。我如何执行此扩展?
还有其他解决方案吗?

提前致谢。

I need to use a tokenizer that splits words on whitespace but that doesn't split if the whitespace is whithin double parenthesis. Here an example:

My input-> term1 term2 term3 ((term4 term5)) term6  

should produce this list of tokens:

term1, term2, term3, ((term4 term5)), term6.  

I think that I can obtain this behaviour by extending Lucene WhiteSpaceTokenizer. How can I perform this extension?
Is there some other solutions?

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

娇女薄笑 2024-12-13 12:49:18

我没有尝试扩展 Tokenizer,但我这里有一个很好的(我认为)带有正则表达式的解决方案:

\w+|\(\([\w\s]*\)\)

以及一种通过来自 reg ex 返回数组的匹配组分割字符串的方法。代码示例:

class Regex_ComandLine {

public static void main(String[] args) {
    String input = "term1 term2 term3 ((term4 term5)) term6";    //your input
    String[] parsedInput = splitByMatchedGroups(input, "\\w+|\\(\\([\\w\\s]*\\)\\)");

    for (String arg : parsedInput) {
        System.out.println(arg);
    }
}

static String[] splitByMatchedGroups(String string,
                                            String patternString) {
    List<String> matchList = new ArrayList<>();
    Matcher regexMatcher = Pattern.compile(patternString).matcher(string);

    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    }

    return matchList.toArray(new String[0]);
}

}

输出:

term1
term2
term3
((term4 term5))
term6

希望这对您有帮助。

请注意,以下代码与通常的 split():

String[] parsedInput = input.split("\\w+|\\(\\([\\w\\s]*\\)\\)");

不会返回任何内容或不会返回您想要的内容,因为它只检查分隔符。

I haven't tried to extend the Tokenizer, but i have here a nice (i think) solution with a regular expression:

\w+|\(\([\w\s]*\)\)

And a method that split a string by matched groups from the reg ex returning an array. Code example:

class Regex_ComandLine {

public static void main(String[] args) {
    String input = "term1 term2 term3 ((term4 term5)) term6";    //your input
    String[] parsedInput = splitByMatchedGroups(input, "\\w+|\\(\\([\\w\\s]*\\)\\)");

    for (String arg : parsedInput) {
        System.out.println(arg);
    }
}

static String[] splitByMatchedGroups(String string,
                                            String patternString) {
    List<String> matchList = new ArrayList<>();
    Matcher regexMatcher = Pattern.compile(patternString).matcher(string);

    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    }

    return matchList.toArray(new String[0]);
}

}

The output:

term1
term2
term3
((term4 term5))
term6

Hope this help you.

Please note that the following code with the usual split():

String[] parsedInput = input.split("\\w+|\\(\\([\\w\\s]*\\)\\)");

will return you nothing or not what you want beacuse it only check delimiters.

君勿笑 2024-12-13 12:49:18

您可以通过扩展 WhitespaceTokenizer 来做到这一点,但我希望如果您编写 TokenFilterWhitespaceTokenizer 读取并根据数字将连续标记粘贴在一起括号。

重写 incrementToken 是编写类似 Tokenizer 的类时的主要任务。我自己也曾经这样做过; 结果 可以作为一个例子(尽管出于技术原因,我无法让我的类成为 TokenFilter )。

You can do this by extending WhitespaceTokenizer, but I expect it will be easier if you write a TokenFilter that reads from a WhitespaceTokenizer and pastes together consecutive tokens based on the number of parentheses.

Overriding incrementToken is the main task when writing a Tokenizer-like class. I once did this myself; the result might serve as an example (though for technical reasons, I couldn't make my class a TokenFilter).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文