正则表达式问题 - 引号内的文本块之外的一个或多个空格

发布于 2024-07-08 13:25:05 字数 103 浏览 6 评论 0原文

我想将任何出现的多个空格替换为单个空格,但在引号之间的文本中不采取任何操作。

有没有办法用 Java 正则表达式来做到这一点? 如果是这样,您可以尝试一下或给我提示吗?

I want to be replace any occurrence of more than one space with a single space, but take no action in text between quotes.

Is there any way of doing this with a Java regex? If so, can you please attempt it or give me a hint?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

几度春秋 2024-07-15 13:25:05

这是另一种方法,它使用前瞻来确定当前位置之后的所有引号都成对出现。

text = text.replaceAll("  ++(?=(?:[^\"]*+\"[^\"]*+\")*+[^\"]*+$)", " ");

如果需要,可以调整前瞻以处理引用部分内的转义引号。

Here's another approach, that uses a lookahead to determine that all quotation marks after the current position come in matched pairs.

text = text.replaceAll("  ++(?=(?:[^\"]*+\"[^\"]*+\")*+[^\"]*+$)", " ");

If needed, the lookahead can be adapted to handle escaped quotation marks inside the quoted sections.

握住我的手 2024-07-15 13:25:05

当尝试匹配可以包含在其他内容中的内容时,构造一个与两者都匹配的正则表达式会很有帮助,如下所示:

("[^"\\]*(?:\\.[^"\\]*)*")|(  +)

这将匹配带引号的字符串或两个或多个空格。 因为两个表达式是组合在一起的,所以它将匹配带引号的字符串或两个或多个空格,但不匹配引号内的空格。 使用此表达式,您需要检查每个匹配项以确定它是带引号的字符串还是两个或更多空格,并采取相应的操作:

Pattern spaceOrStringRegex = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );

StringBuffer replacementBuffer = new StringBuffer();

Matcher spaceOrStringMatcher = spaceOrStringRegex.matcher( text );

while ( spaceOrStringMatcher.find() ) 
{
    // if the space group is the match
    if ( spaceOrStringMatcher.group( 2 ) != null ) 
    {
        // replace with a single space
        spaceOrStringMatcher.appendReplacement( replacementBuffer, " " );
    }
}

spaceOrStringMatcher.appendTail( replacementBuffer );

When trying to match something that can be contained within something else, it can be helpful to construct a regular expression that matches both, like this:

("[^"\\]*(?:\\.[^"\\]*)*")|(  +)

This will match a quoted string or two or more spaces. Because the two expressions are combined, it will match a quoted string OR two or more spaces, but not spaces within quotes. Using this expression, you will need to examine each match to determine if it is a quoted string or two or more spaces and act accordingly:

Pattern spaceOrStringRegex = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );

StringBuffer replacementBuffer = new StringBuffer();

Matcher spaceOrStringMatcher = spaceOrStringRegex.matcher( text );

while ( spaceOrStringMatcher.find() ) 
{
    // if the space group is the match
    if ( spaceOrStringMatcher.group( 2 ) != null ) 
    {
        // replace with a single space
        spaceOrStringMatcher.appendReplacement( replacementBuffer, " " );
    }
}

spaceOrStringMatcher.appendTail( replacementBuffer );
橘寄 2024-07-15 13:25:05

引号之间的文本:引号是在同一行还是多行内?

text between quotes : Are the quotes within the same line or multiple lines ?

未央 2024-07-15 13:25:05

将其标记化并在标记之间发出一个空格。 快速谷歌搜索“处理引号的 java tokenizer”出现:
此链接

YMMV

编辑:所以没有就像那个链接一样。 这是谷歌搜索链接: 谷歌。 这是第一个结果。

Tokenize it and emit a single space between tokens. A quick google for "java tokenizer that handles quotes" turned up:
this link

YMMV

edit: SO didn't like that link. Here's the google search link: google. It was the first result.

猥琐帝 2024-07-15 13:25:05

就我个人而言,我不使用 Java,但是这个 RegExp 可以解决这个问题:

([^\" ])*(\\\".*?\\\")*

尝试使用 RegExBuddy 的表达式,它会生成以下代码,对我来说看起来不错:

try {
    Pattern regex = Pattern.compile("([^\" ])*(\\\".*?\\\")*", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        for (int i = 1; i <= regexMatcher.groupCount(); i++) {
            // matched text: regexMatcher.group(i)
            // match start: regexMatcher.start(i)
            // match end: regexMatcher.end(i)

            // I suppose here you must use something like
            // sstr += regexMatcher.group(i) + " "
        }
    }
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

至少,它似乎在 Python 中工作正常:

import re

text = """
este  es   un texto de   prueba "para ver  como se comporta  " la funcion   sobre esto
"para ver  como se comporta  " la funcion   sobre esto  "o sobre otro" lo q sea
"""

ret = ""
print text  

reobj = re.compile(r'([^\" ])*(\".*?\")*', re.IGNORECASE)

for match in reobj.finditer(text):
    if match.group() <> "":
        ret = ret + match.group() + "|"

print ret

Personally, I don't use Java, but this RegExp could do the trick:

([^\" ])*(\\\".*?\\\")*

Trying the expression with RegExBuddy, it generates this code, looks fine to me:

try {
    Pattern regex = Pattern.compile("([^\" ])*(\\\".*?\\\")*", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        for (int i = 1; i <= regexMatcher.groupCount(); i++) {
            // matched text: regexMatcher.group(i)
            // match start: regexMatcher.start(i)
            // match end: regexMatcher.end(i)

            // I suppose here you must use something like
            // sstr += regexMatcher.group(i) + " "
        }
    }
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

At least, it seems to work fine in Python:

import re

text = """
este  es   un texto de   prueba "para ver  como se comporta  " la funcion   sobre esto
"para ver  como se comporta  " la funcion   sobre esto  "o sobre otro" lo q sea
"""

ret = ""
print text  

reobj = re.compile(r'([^\" ])*(\".*?\")*', re.IGNORECASE)

for match in reobj.finditer(text):
    if match.group() <> "":
        ret = ret + match.group() + "|"

print ret
迟月 2024-07-15 13:25:05

解析出引用的内容后,根据需要批量或逐段运行其余内容:

String text = "ABC   DEF GHI   JKL";
text = text.replaceAll("( )+", " ");
// text: "ABC DEF GHI JKL"

After you parse out the quoted content, run this on the rest, in bulk or piece by piece as necessary:

String text = "ABC   DEF GHI   JKL";
text = text.replaceAll("( )+", " ");
// text: "ABC DEF GHI JKL"
听闻余生 2024-07-15 13:25:05

Jeff,您的方向是正确的,但是您的代码中有一些错误,即:(1)您忘记转义否定字符类中的引号; (2) 第一个捕获组内的括号应该是非捕获类型; (3) 如果第二组捕获括号不参与匹配,group(2) 返回 null,并且您不会对此进行测试; (4) 如果您在正则表达式中测试两个或更多 空格而不是一个或多个,则稍后无需检查匹配的长度。 这是修改后的代码:

import java.util.regex.*;

public class Test
{
  public static void main(String[] args) throws Exception
  {
    String text = "blah    blah  \"boo   boo boo\"  blah  blah";
    Pattern p = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );
    StringBuffer sb = new StringBuffer();
    Matcher m = p.matcher( text );
    while ( m.find() ) 
    {
      if ( m.group( 2 ) != null ) 
      {
        m.appendReplacement( sb, " " );
      }
    }
    m.appendTail( sb );
    System.out.println( sb.toString() );
  }
}

Jeff, you're on the right track, but there are a few errors in your code, to wit: (1) You forgot to escape the quotation marks inside the negated character classes; (2) The parens inside the first capturing group should have been of the non-capturing variety; (3) If the second set of capturing parens doesn't participate in a match, group(2) returns null, and you're not testing for that; and (4) If you test for two or more spaces in the regex instead of one or more, you don't need to check the length of the match later on. Here's the revised code:

import java.util.regex.*;

public class Test
{
  public static void main(String[] args) throws Exception
  {
    String text = "blah    blah  \"boo   boo boo\"  blah  blah";
    Pattern p = Pattern.compile( "(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")|(  +)" );
    StringBuffer sb = new StringBuffer();
    Matcher m = p.matcher( text );
    while ( m.find() ) 
    {
      if ( m.group( 2 ) != null ) 
      {
        m.appendReplacement( sb, " " );
      }
    }
    m.appendTail( sb );
    System.out.println( sb.toString() );
  }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文