使用正则表达式时如何保留分隔符?

发布于 2024-11-30 15:21:23 字数 991 浏览 1 评论 0原文

我做了一个关于标点符号和正则表达式的问题,但很混乱。

假设我有这样的文本:

String text = "wor.d1, :word2. wo,rd3? word4!"; 

我正在这样做:

String parts[] = text.split(" ");

并且我有这个:

wor.d1, | :word2. | wor,d3? | word4!;

我需要做什么才能得到这个?(将符号保留在边框处,但仅限我指定的:.,!?:,不是全部)。

wor,d1 | , | : | word2 | . | wor,d3 | ? | word4 | !

更新

我使用这些正则表达式得到了一些好的结果,但它在单词开头的标点符号上的所有分割之前给出了一个空字符。

有办法让开头没有这个空字符吗?

这个正则表达式很好,还是有更简单的方法?

public static final String PUNCTUATION_SEPARATOR =
        "("
        + "("
        + "(?=^[\"'!?.,;:(){}\\[\\]]+)"
        + "|"
        + "(?<=^[\"'!?.,;:(){}\\[\\]]+)"
        + ")"
        + "|"
        + "("
        + "(?=[\"'!?.,;:(){}\\[\\]]+($|\n))"
        + "|"
        + "(?<=[\"'!?.,;:(){}\\[\\]]+($|\n))"
        + ")"
        + ")";

I did a question about punctuation and regex, but it was confusing.

Supossing I have this text:

String text = "wor.d1, :word2. wo,rd3? word4!"; 

I'm doing this:

String parts[] = text.split(" ");

And I have this:

wor.d1, | :word2. | wor,d3? | word4!;

What I need to do to have this? (Keep the the symbols at the borders, but only that I specify: .,!?:, not all).

wor,d1 | , | : | word2 | . | wor,d3 | ? | word4 | !

UPDATE

I'm getting some good results with these regex, but it's giving an empty char before all splits on punctuation at start of a word.

There is a way to not have this empty char at the start?

Is this regex is good, or there is a more simple way?

public static final String PUNCTUATION_SEPARATOR =
        "("
        + "("
        + "(?=^[\"'!?.,;:(){}\\[\\]]+)"
        + "|"
        + "(?<=^[\"'!?.,;:(){}\\[\\]]+)"
        + ")"
        + "|"
        + "("
        + "(?=[\"'!?.,;:(){}\\[\\]]+($|\n))"
        + "|"
        + "(?<=[\"'!?.,;:(){}\\[\\]]+($|\n))"
        + ")"
        + ")";

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

黯然 2024-12-07 15:21:23

您确定要使用正则表达式吗?
有一个更快的按单个字符分割的实现:StringTokenizer。
它可以返回分隔符。

String str= "word1, word2. word3? word4!";
String delim = ",.!?";
StringTokenizer st = new StringTokenizer(str, delim, true);
while (st.hasMoreTokens()) {
  String token = st.nextToken();
  ... // token will be: "word1", ",", " word2", ".", etc...
}

Are you sure you want to use regex ?
There's a faster implementation for splitting by single char: StringTokenizer.
And it that can return the delimiters.

String str= "word1, word2. word3? word4!";
String delim = ",.!?";
StringTokenizer st = new StringTokenizer(str, delim, true);
while (st.hasMoreTokens()) {
  String token = st.nextToken();
  ... // token will be: "word1", ",", " word2", ".", etc...
}
梦巷 2024-12-07 15:21:23

对于简单的分隔符,我推荐 StringTokenizer。但这里有一个使用正则表达式和另一个辅助分隔符的解决方案:

String s  = "one,two, three   four ,  five";
s = s.replaceAll("([,\\s]+)", "#$1#");
Pattern p = Pattern.compile("#");
String[] result = p.split(s);

For simple separators I recommend the StringTokenizer. But here's a solution using regex and another auxiliary separator:

String s  = "one,two, three   four ,  five";
s = s.replaceAll("([,\\s]+)", "#$1#");
Pattern p = Pattern.compile("#");
String[] result = p.split(s);
背叛残局 2024-12-07 15:21:23

这是我认为可行的正则表达式:

/\s|(?=[\.,:?!](\W|$))|(?<=\W[\.:?!])/

Here's a regex that I think will work:

/\s|(?=[\.,:?!](\W|$))|(?<=\W[\.:?!])/
热情消退 2024-12-07 15:21:23

在我看来,你想要 这个 。首先爆炸你的字符串,第二步使用内爆函数。

In my opinion you want this. First you explode your string and second step you use implode function.

上课铃就是安魂曲 2024-12-07 15:21:23
public static final String PUNCTUATION_SEPARATOR =
    "("
    + "("
    + "(?=^[\"'!?.,;:(){}\\[\\]-]+)"
    + "|"
    + "(?<=^[\"'!?.,;:(){}\\[\\]-]+)"
    + ")"
    + "|"
    + "("
    + "(?=[\"'!?.,;:(){}\\[\\]-]+($|\n))"
    + "|"
    + "(?<=[\"'!?.,;:(){}\\[\\]-]+($|\n))"
    + ")"
    + ")";
public static final String PUNCTUATION_SEPARATOR =
    "("
    + "("
    + "(?=^[\"'!?.,;:(){}\\[\\]-]+)"
    + "|"
    + "(?<=^[\"'!?.,;:(){}\\[\\]-]+)"
    + ")"
    + "|"
    + "("
    + "(?=[\"'!?.,;:(){}\\[\\]-]+($|\n))"
    + "|"
    + "(?<=[\"'!?.,;:(){}\\[\\]-]+($|\n))"
    + ")"
    + ")";
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文