正则表达式以及转义和未转义分隔符

发布于 2024-12-12 09:38:28 字数 503 浏览 0 评论 0原文

this相关的问题

我有一个字符串

a\;b\\;c;d

,在Java中看起来

String s = "a\\;b\\\\;c;d"

我需要按照以下规则用分号分隔它:

  1. 如果分号前面有反斜杠,不应将其视为分隔符(ab 之间)。

  2. 如果反斜杠本身被转义,因此不会转义自己的分号,则该分号应该是分隔符(在 bc 之间)。

因此,如果分号前面有零个或偶数个反斜杠,则应将其视为分隔符。

例如上面的例子,我想得到以下字符串(java编译器的双反斜杠):

a\;b\\
c
d

question related to this

I have a string

a\;b\\;c;d

which in Java looks like

String s = "a\\;b\\\\;c;d"

I need to split it by semicolon with following rules:

  1. If semicolon is preceded by backslash, it should not be treated as separator (between a and b).

  2. If backslash itself is escaped and therefore does not escape itself semicolon, that semicolon should be separator (between b and c).

So semicolon should be treated as separator if there is either zero or even number of backslashes before it.

For example above, I want to get following strings (double backslashes for java compiler):

a\;b\\
c
d

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

阿楠 2024-12-19 09:38:28

您可以使用正则表达式

(?:\\.|[^;\\]++)*

来匹配未转义分号之间的所有文本:

List<String> matchList = new ArrayList<String>();
try {
    Pattern regex = Pattern.compile("(?:\\\\.|[^;\\\\]++)*");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    } 

说明:

(?:        # Match either...
 \\.       # any escaped character
|          # or...
 [^;\\]++  # any character(s) except semicolon or backslash; possessive match
)*         # Repeat any number of times.

所有格匹配 (++) 对于避免由于嵌套量词而导致灾难性回溯非常重要。

You can use the regex

(?:\\.|[^;\\]++)*

to match all text between unescaped semicolons:

List<String> matchList = new ArrayList<String>();
try {
    Pattern regex = Pattern.compile("(?:\\\\.|[^;\\\\]++)*");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    } 

Explanation:

(?:        # Match either...
 \\.       # any escaped character
|          # or...
 [^;\\]++  # any character(s) except semicolon or backslash; possessive match
)*         # Repeat any number of times.

The possessive match (++) is important to avoid catastrophic backtracking because of the nested quantifiers.

幸福不弃 2024-12-19 09:38:28

我不相信用任何类型的正则表达式来检测这些情况。我通常会为此类事情做一个简单的循环,我将使用 C 绘制它的草图,因为我上次接触 Java 已经是很久以前的事情了;-)

int i, len, state;
char c;

for (len=myString.size(), state=0, i=0; i < len; i++) {
    c=myString[i];
    if (state == 0) {
       if (c == '\\') {
            state++;
       } else if (c == ';') {
           printf("; at offset %d", i);
       }
    } else {
        state--;
    }
}

优点 are:

  1. 您可以在每个步骤上执行语义操作。
  2. 将其移植到另一种语言非常容易。
  3. 您不需要仅仅为了这个简单的任务就包含完整的正则表达式库,这增加了可移植性。
  4. 它应该比正则表达式匹配器快很多。

编辑:
我添加了一个完整的 C++ 示例来进行说明。

#include <iostream>                                                             
#include <sstream>                                                              
#include <string>                                                               
#include <vector>                                                               
                                                                                
std::vector<std::string> unescapeString(const char* s)                        
{                                                                               
    std::vector<std::string> result;                                            
    std::stringstream ss;                                                       
    bool has_chars;                                                             
    int state;                                                                  
                                                                                
    for (has_chars = false, state = 0;;) {                                      
        auto c = *s++;                                                          
                                                                                
        if (state == 0) {                                                       
            if (!c) {                                                           
                if (has_chars) result.push_back(ss.str());                      
                break;                                                          
            } else if (c == '\\') {                                             
                ++state;                                                        
            } else if (c == ';') {                                              
                if (has_chars) {                                                
                    result.push_back(ss.str());                                 
                    has_chars = false;                                          
                    ss.str("");                                                 
                }                                                               
            } else {                                                            
                ss << c;                                                        
                has_chars = true;                                               
            }                                                                   
        } else /* if (state == 1) */ {                                          
            if (!c) {                                                           
                ss << '\\';                                                     
                result.push_back(ss.str());                                     
                break;                                                          
            }                                                                   
                                                                                
            ss << c;                                                            
            has_chars = true;                                                   
            --state;                                                            
        }                                                                       
    }                                                                           
                                                                                
    return result;                                                              
}                                                                               
                                                                                
int main(int argc, char* argv[])                                                
{                                                                               
    for (size_t i = 1; i < argc; ++i) {                                         
        for (const auto& s: unescapeString(argv[i])) {                          
            std::cout << s << std::endl;                                        
        }                                                                       
    }                                                                           
}                                                     

I do not trust to detect those cases with any kind of regular expression. I usually do a simple loop for such things, I'll sketch it using C since it's ages ago I last touched Java ;-)

int i, len, state;
char c;

for (len=myString.size(), state=0, i=0; i < len; i++) {
    c=myString[i];
    if (state == 0) {
       if (c == '\\') {
            state++;
       } else if (c == ';') {
           printf("; at offset %d", i);
       }
    } else {
        state--;
    }
}

The advantages are:

  1. you can execute semantic actions on each step.
  2. it's quite easy to port it to another language.
  3. you don't need to include the complete regex library just for this simple task, which adds to portability.
  4. it should be a lot faster than the regular expression matcher.

EDIT:
I have added a complete C++ example for clarification.

#include <iostream>                                                             
#include <sstream>                                                              
#include <string>                                                               
#include <vector>                                                               
                                                                                
std::vector<std::string> unescapeString(const char* s)                        
{                                                                               
    std::vector<std::string> result;                                            
    std::stringstream ss;                                                       
    bool has_chars;                                                             
    int state;                                                                  
                                                                                
    for (has_chars = false, state = 0;;) {                                      
        auto c = *s++;                                                          
                                                                                
        if (state == 0) {                                                       
            if (!c) {                                                           
                if (has_chars) result.push_back(ss.str());                      
                break;                                                          
            } else if (c == '\\') {                                             
                ++state;                                                        
            } else if (c == ';') {                                              
                if (has_chars) {                                                
                    result.push_back(ss.str());                                 
                    has_chars = false;                                          
                    ss.str("");                                                 
                }                                                               
            } else {                                                            
                ss << c;                                                        
                has_chars = true;                                               
            }                                                                   
        } else /* if (state == 1) */ {                                          
            if (!c) {                                                           
                ss << '\\';                                                     
                result.push_back(ss.str());                                     
                break;                                                          
            }                                                                   
                                                                                
            ss << c;                                                            
            has_chars = true;                                                   
            --state;                                                            
        }                                                                       
    }                                                                           
                                                                                
    return result;                                                              
}                                                                               
                                                                                
int main(int argc, char* argv[])                                                
{                                                                               
    for (size_t i = 1; i < argc; ++i) {                                         
        for (const auto& s: unescapeString(argv[i])) {                          
            std::cout << s << std::endl;                                        
        }                                                                       
    }                                                                           
}                                                     
驱逐舰岛风号 2024-12-19 09:38:28
String[] splitArray = subjectString.split("(?<!(?<!\\\\)\\\\);");

这应该有效。

解释:

// (?<!(?<!\\)\\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\)\\)»
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\)»
//       Match the character “\” literally «\\»
//    Match the character “\” literally «\\»
// Match the character “;” literally «;»

因此,您只需匹配分号,而分号前面不正好有一个\

编辑:

String[] splitArray = subjectString.split("(?<!(?<!\\\\(\\\\\\\\){0,2000000})\\\\);");

这将处理任何奇数个 .如果 \ 数量超过 4000000 个,当然会失败。编辑答案的解释:

// (?<!(?<!\\(\\\\){0,2000000})\\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\(\\\\){0,2000000})\\)»
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\(\\\\){0,2000000})»
//       Match the character “\” literally «\\»
//       Match the regular expression below and capture its match into backreference number 1 «(\\\\){0,2000000}»
//          Between zero and 2000000 times, as many times as possible, giving back as needed (greedy) «{0,2000000}»
//          Note: You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «{0,2000000}»
//          Match the character “\” literally «\\»
//          Match the character “\” literally «\\»
//    Match the character “\” literally «\\»
// Match the character “;” literally «;»
String[] splitArray = subjectString.split("(?<!(?<!\\\\)\\\\);");

This should work.

Explanation :

// (?<!(?<!\\)\\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\)\\)»
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\)»
//       Match the character “\” literally «\\»
//    Match the character “\” literally «\\»
// Match the character “;” literally «;»

So you just match the semicolons not preceded by exactly one \.

EDIT :

String[] splitArray = subjectString.split("(?<!(?<!\\\\(\\\\\\\\){0,2000000})\\\\);");

This will take care of any odd number of . It will of course fail if you have more than 4000000 number of \. Explanation of edited answer :

// (?<!(?<!\\(\\\\){0,2000000})\\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\(\\\\){0,2000000})\\)»
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\(\\\\){0,2000000})»
//       Match the character “\” literally «\\»
//       Match the regular expression below and capture its match into backreference number 1 «(\\\\){0,2000000}»
//          Between zero and 2000000 times, as many times as possible, giving back as needed (greedy) «{0,2000000}»
//          Note: You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «{0,2000000}»
//          Match the character “\” literally «\\»
//          Match the character “\” literally «\\»
//    Match the character “\” literally «\\»
// Match the character “;” literally «;»
阿楠 2024-12-19 09:38:28

此方法假设您的字符串中不包含 char '\0'。如果这样做,您可以使用其他字符。

public static String[] split(String s) {
    String[] result = s.replaceAll("([^\\\\])\\\\;", "$1\0").split(";");
    for (int i = 0; i < result.length; i++) {
        result[i] = result[i].replaceAll("\0", "\\\\;");
    }
    return result;
}

This approach assumes that your string will not have char '\0' in your string. If you do, you can use some other char.

public static String[] split(String s) {
    String[] result = s.replaceAll("([^\\\\])\\\\;", "$1\0").split(";");
    for (int i = 0; i < result.length; i++) {
        result[i] = result[i].replaceAll("\0", "\\\\;");
    }
    return result;
}
别念他 2024-12-19 09:38:28

这是我认为的真实答案。
就我而言,我尝试使用 | 进行拆分,转义字符是 &

    final String regx = "(?<!((?:[^&]|^)(&&){0,10000}&))\\|";
    String[] res = "&|aa|aa|&|&&&|&&|s||||e|".split(regx);
    System.out.println(Arrays.toString(res));

在此代码中,我使用 Lookbehind 来转义 &特点。
请注意,后面的外观必须具有最大长度。

(?<!((?:[^&]|^)(&&){0,10000}&))\\|

这意味着除了 ((?:[^&]|^)(&&){0,10000}&)) 之后的任何 |这部分表示任意奇数个&
(?:[^&]|^) 部分对于确保计算 |& 非常重要code> 开头或一些其他字符。

This is the real answer i think.
In my case i am trying to split using | and escape character is &.

    final String regx = "(?<!((?:[^&]|^)(&&){0,10000}&))\\|";
    String[] res = "&|aa|aa|&|&&&|&&|s||||e|".split(regx);
    System.out.println(Arrays.toString(res));

In this code i am using Lookbehind to escape & character.
note that the look behind must have maximum length.

(?<!((?:[^&]|^)(&&){0,10000}&))\\|

this means any | except those that are following ((?:[^&]|^)(&&){0,10000}&)) and this part means any odd number of &s.
the part (?:[^&]|^) is important to make sure that you are counting all of the &s behind the | to the beginning or some other characters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文