如何扩展WhitespaceTokenizer？

发布于 12-06 12:49 字数 447 浏览 3 评论 0原文

我需要使用一个分词器来分割空格上的单词，但如果空格位于双括号内，则不会分割。这里有一个例子：

My input-> term1 term2 term3 ((term4 term5)) term6

应该产生这个令牌列表：

term1, term2, term3, ((term4 term5)), term6.

我认为我可以通过扩展 Lucene WhiteSpaceTokenizer。我如何执行此扩展？
还有其他解决方案吗？

提前致谢。

原文

I need to use a tokenizer that splits words on whitespace but that doesn't split if the whitespace is whithin double parenthesis. Here an example:

My input-> term1 term2 term3 ((term4 term5)) term6

should produce this list of tokens:

term1, term2, term3, ((term4 term5)), term6.

I think that I can obtain this behaviour by extending Lucene WhiteSpaceTokenizer. How can I perform this extension?
Is there some other solutions?

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娇女薄笑2024-12-13 12:49:18

我没有尝试扩展 Tokenizer，但我这里有一个很好的（我认为）带有正则表达式的解决方案：

\w+|\(\([\w\s]*\)\)

以及一种通过来自 reg ex 返回数组的匹配组分割字符串的方法。代码示例：

class Regex_ComandLine {

public static void main(String[] args) {
    String input = "term1 term2 term3 ((term4 term5)) term6";    //your input
    String[] parsedInput = splitByMatchedGroups(input, "\\w+|\\(\\([\\w\\s]*\\)\\)");

    for (String arg : parsedInput) {
        System.out.println(arg);
    }
}

static String[] splitByMatchedGroups(String string,
                                            String patternString) {
    List<String> matchList = new ArrayList<>();
    Matcher regexMatcher = Pattern.compile(patternString).matcher(string);

    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    }

    return matchList.toArray(new String[0]);
}

}

输出：

term1
term2
term3
((term4 term5))
term6

希望这对您有帮助。

请注意，以下代码与通常的 split():

String[] parsedInput = input.split("\\w+|\\(\\([\\w\\s]*\\)\\)");

不会返回任何内容或不会返回您想要的内容，因为它只检查分隔符。

I haven't tried to extend the Tokenizer, but i have here a nice (i think) solution with a regular expression:

\w+|\(\([\w\s]*\)\)

And a method that split a string by matched groups from the reg ex returning an array. Code example:

class Regex_ComandLine {

public static void main(String[] args) {
    String input = "term1 term2 term3 ((term4 term5)) term6";    //your input
    String[] parsedInput = splitByMatchedGroups(input, "\\w+|\\(\\([\\w\\s]*\\)\\)");

    for (String arg : parsedInput) {
        System.out.println(arg);
    }
}

static String[] splitByMatchedGroups(String string,
                                            String patternString) {
    List<String> matchList = new ArrayList<>();
    Matcher regexMatcher = Pattern.compile(patternString).matcher(string);

    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    }

    return matchList.toArray(new String[0]);
}

}

The output:

term1
term2
term3
((term4 term5))
term6

Hope this help you.

Please note that the following code with the usual split():

String[] parsedInput = input.split("\\w+|\\(\\([\\w\\s]*\\)\\)");

will return you nothing or not what you want beacuse it only check delimiters.

回复收藏 0 原文

君勿笑2024-12-13 12:49:18

您可以通过扩展 WhitespaceTokenizer 来做到这一点，但我希望如果您编写 TokenFilter 从 WhitespaceTokenizer 读取并根据数字将连续标记粘贴在一起括号。

重写 incrementToken 是编写类似 Tokenizer 的类时的主要任务。我自己也曾经这样做过；结果可以作为一个例子（尽管出于技术原因，我无法让我的类成为 TokenFilter ）。

回复收藏 0 原文

~没有更多了~

关于作者

残月升风

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如何扩展WhitespaceTokenizer？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

Mr.HU

疯到世界奔溃

隔纱相望

萌无敌

梦幻的味道

自在安然

友情链接

如何扩展WhitespaceTokenizer？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

Mr.HU

疯到世界奔溃

隔纱相望

萌无敌

梦幻的味道

自在安然

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。