java中用空格标记字符串

发布于 2024-08-06 11:40:38 字数 263 浏览 8 评论 0原文

我想标记这样的字符串

String line = "a=b c='123 456' d=777 e='uij yyy'";

我不能像这样分割

String [] words = line.split(" ");

任何想法如何分割以便我得到像这样的标记

a=b
c='123 456'
d=777
e='uij yyy';  

I want to tokenize a string like this

String line = "a=b c='123 456' d=777 e='uij yyy'";

I cannot split based like this

String [] words = line.split(" ");

Any idea how can I split so that I get tokens like

a=b
c='123 456'
d=777
e='uij yyy';  

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

懵少女 2024-08-13 11:40:38

最简单的方法是手动实现一个简单的有限状态机。换句话说,一次处理一个字符的字符串:

  • 当您点击空格时,断开一个标记;
  • 当您点击一个引号时,将继续获取字符,直到您点击另一个引号。

The simplest way to do this is by hand implementing a simple finite state machine. In other words, process the string a character at a time:

  • When you hit a space, break off a token;
  • When you hit a quote keep getting characters until you hit another quote.
揪着可爱 2024-08-13 11:40:38

根据原始字符串的格式,您应该能够使用正则表达式作为 java“split”方法的参数: 单击此处查看示例

不过,该示例并未使用此任务所需的正则表达式。

您还可以使用 这个SO线程作为指南(尽管它是用PHP编写的),它所做的事情非常接近您所需要的。稍微进行一些操作可能会达到目的(尽管引号是否成为输出的一部分可能会导致一些问题)。请记住,正则表达式在大多数语言中都非常相似。

编辑:对此类任务进行过多深入研究可能会超出正则表达式的功能,因此您可能需要创建一个简单的解析器。

Depending on the formatting of your original string, you should be able to use a regular expression as a parameter to the java "split" method: Click here for an example.

The example doesn't use the regular expression that you would need for this task though.

You can also use this SO thread as a guideline (although it's in PHP) which does something very close to what you need. Manipulating that slightly might do the trick (although having quotes be part of the output or not may cause some issues). Keep in mind that regex is very similar in most languages.

Edit: going too much further into this type of task may be ahead of the capabilities of regex, so you may need to create a simple parser.

疏忽 2024-08-13 11:40:38
line.split(" (?=[a-z+]=)")

正确给出:

a=b
c='123 456'
d=777
e='uij yyy'

确保调整 [a-z+] 部分,以防您的键结构发生变化。

编辑:如果该对的值部分中有“=”字符,则此解决方案可能会严重失败。

line.split(" (?=[a-z+]=)")

correctly gives:

a=b
c='123 456'
d=777
e='uij yyy'

Make sure you adapt the [a-z+] part in case your keys structure changes.

Edit: this solution can fail miserably if there is a "=" character in the value part of the pair.

不羁少年 2024-08-13 11:40:38

StreamTokenizer 可以提供帮助,尽管它是最简单的设置在“=”处中断,因为它总是在带引号的字符串的开头处中断:

String s = "Ta=b c='123 456' d=777 e='uij yyy'";
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.ordinaryChars('0', '9');
st.wordChars('0', '9');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
    switch (st.ttype) {
    case StreamTokenizer.TT_NUMBER:
        System.out.println(st.nval);
        break;
    case StreamTokenizer.TT_WORD:
        System.out.println(st.sval);
        break;
    case '=':
        System.out.println("=");
        break;
    default:
        System.out.println(st.sval);
    }
}

输出

Ta
=
b
c
=
123 456
d
=
777
e
=
uij yyy

如果省略将数字字符转换为字母的两行,则会得到d=777.0,这可能对您有用。

StreamTokenizer can help, although it is easiest to set up to break on '=', as it will always break at the start of a quoted string:

String s = "Ta=b c='123 456' d=777 e='uij yyy'";
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.ordinaryChars('0', '9');
st.wordChars('0', '9');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
    switch (st.ttype) {
    case StreamTokenizer.TT_NUMBER:
        System.out.println(st.nval);
        break;
    case StreamTokenizer.TT_WORD:
        System.out.println(st.sval);
        break;
    case '=':
        System.out.println("=");
        break;
    default:
        System.out.println(st.sval);
    }
}

outputs

Ta
=
b
c
=
123 456
d
=
777
e
=
uij yyy

If you leave out the two lines that convert numeric characters to alpha, then you get d=777.0, which might be useful to you.

ゝ偶尔ゞ 2024-08-13 11:40:38

假设:

  • 您的变量名称(赋值“a=b”中的“a”)的长度可以为 1 或更长
  • 您的变量名称(赋值“a=b”中的“a”)不能包含空格字符或其他任何字符很好。
  • 不需要验证您的输入(假定输入采用有效的 a=b 格式)

这对我来说效果很好。

输入:

a=b abc='123 456' &=777 #='uij yyy' ABC='slk slk'              123sdkljhSDFjflsakd@*#&=456sldSLKD)#(

输出:

a=b
abc='123 456'
&=777
#='uij yyy'
ABC='slk slk'             
123sdkljhSDFjflsakd@*#&=456sldSLKD)#(

代码:

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTest {

    // SPACE CHARACTER                                          followed by
    // sequence of non-space characters of 1 or more            followed by
    // first occuring EQUALS CHARACTER       
    final static String regex = " [^ ]+?=";


    // static pattern defined outside so that you don't have to compile it 
    // for each method call
    static final Pattern p = Pattern.compile(regex);

    public static List<String> tokenize(String input, Pattern p){
        input = input.trim(); // this is important for "last token case"
                                // see end of method
        Matcher m = p.matcher(input);
        ArrayList<String> tokens = new ArrayList<String>();
        int beginIndex=0;
        while(m.find()){
            int endIndex = m.start();
            tokens.add(input.substring(beginIndex, endIndex));
            beginIndex = endIndex+1;
        }

        // LAST TOKEN CASE
        //add last token
        tokens.add(input.substring(beginIndex));

        return tokens;
    }

    private static void println(List<String> tokens) {
        for(String token:tokens){
            System.out.println(token);
        }
    }


    public static void main(String args[]){
        String test = "a=b " +
                "abc='123 456' " +
                "&=777 " +
                "#='uij yyy' " +
                "ABC='slk slk'              " +
                "123sdkljhSDFjflsakd@*#&=456sldSLKD)#(";
        List<String> tokens = RegexTest.tokenize(test, p);
        println(tokens);
    }
}

Assumptions:

  • Your variable name ('a' in the assignment 'a=b') can be of length 1 or more
  • Your variable name ('a' in the assignment 'a=b') can not contain the space character, anything else is fine.
  • Validation of your input is not required (input assumed to be in valid a=b format)

This works fine for me.

Input:

a=b abc='123 456' &=777 #='uij yyy' ABC='slk slk'              123sdkljhSDFjflsakd@*#&=456sldSLKD)#(

Output:

a=b
abc='123 456'
&=777
#='uij yyy'
ABC='slk slk'             
123sdkljhSDFjflsakd@*#&=456sldSLKD)#(

Code:

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTest {

    // SPACE CHARACTER                                          followed by
    // sequence of non-space characters of 1 or more            followed by
    // first occuring EQUALS CHARACTER       
    final static String regex = " [^ ]+?=";


    // static pattern defined outside so that you don't have to compile it 
    // for each method call
    static final Pattern p = Pattern.compile(regex);

    public static List<String> tokenize(String input, Pattern p){
        input = input.trim(); // this is important for "last token case"
                                // see end of method
        Matcher m = p.matcher(input);
        ArrayList<String> tokens = new ArrayList<String>();
        int beginIndex=0;
        while(m.find()){
            int endIndex = m.start();
            tokens.add(input.substring(beginIndex, endIndex));
            beginIndex = endIndex+1;
        }

        // LAST TOKEN CASE
        //add last token
        tokens.add(input.substring(beginIndex));

        return tokens;
    }

    private static void println(List<String> tokens) {
        for(String token:tokens){
            System.out.println(token);
        }
    }


    public static void main(String args[]){
        String test = "a=b " +
                "abc='123 456' " +
                "&=777 " +
                "#='uij yyy' " +
                "ABC='slk slk'              " +
                "123sdkljhSDFjflsakd@*#&=456sldSLKD)#(";
        List<String> tokens = RegexTest.tokenize(test, p);
        println(tokens);
    }
}
疧_╮線 2024-08-13 11:40:38

或者,使用用于标记化的正则表达式,以及一个仅将键/值添加到映射的小型状态机:

String line = "a = b c='123 456' d=777 e =  'uij yyy'";
Map<String,String> keyval = new HashMap<String,String>();
String state = "key";
Matcher m = Pattern.compile("(=|'[^']*?'|[^\\s=]+)").matcher(line);
String key = null;
while (m.find()) {
    String found = m.group();
    if (state.equals("key")) {
        if (found.equals("=") || found.startsWith("'"))
            { System.err.println ("ERROR"); }
        else { key = found; state = "equals"; }
    } else if (state.equals("equals")) {
        if (! found.equals("=")) { System.err.println ("ERROR"); }
        else { state = "value"; }
    } else if (state.equals("value")) {
        if (key == null) { System.err.println ("ERROR"); }
        else {
            if (found.startsWith("'"))
                found = found.substring(1,found.length()-1);
            keyval.put (key, found);
            key = null;
            state = "key";
        }
    }
}
if (! state.equals("key"))  { System.err.println ("ERROR"); }
System.out.println ("map: " + keyval);

打印输出

map: {d=777, e=uij yyy, c=123 456, a=b}

它会进行一些基本的错误检查,并从值中去掉引号。

Or, with a regex for tokenizing, and a little state machine that just adds the key/val to a map:

String line = "a = b c='123 456' d=777 e =  'uij yyy'";
Map<String,String> keyval = new HashMap<String,String>();
String state = "key";
Matcher m = Pattern.compile("(=|'[^']*?'|[^\\s=]+)").matcher(line);
String key = null;
while (m.find()) {
    String found = m.group();
    if (state.equals("key")) {
        if (found.equals("=") || found.startsWith("'"))
            { System.err.println ("ERROR"); }
        else { key = found; state = "equals"; }
    } else if (state.equals("equals")) {
        if (! found.equals("=")) { System.err.println ("ERROR"); }
        else { state = "value"; }
    } else if (state.equals("value")) {
        if (key == null) { System.err.println ("ERROR"); }
        else {
            if (found.startsWith("'"))
                found = found.substring(1,found.length()-1);
            keyval.put (key, found);
            key = null;
            state = "key";
        }
    }
}
if (! state.equals("key"))  { System.err.println ("ERROR"); }
System.out.println ("map: " + keyval);

prints out

map: {d=777, e=uij yyy, c=123 456, a=b}

It does some basic error checking, and takes the quotes off the values.

薄荷梦 2024-08-13 11:40:38

这个解决方案既通用又紧凑(它实际上是 cletus 答案的正则表达式版本):

String line = "a=b c='123 456' d=777 e='uij yyy'";
Matcher m = Pattern.compile("('[^']*?'|\\S)+").matcher(line);
while (m.find()) {
  System.out.println(m.group()); // or whatever you want to do
}

换句话说,找到所有由带引号的字符串或非空格字符组合而成的字符;不支持嵌套引号(没有转义字符)。

This solution is both general and compact (it is effectively the regex version of cletus' answer):

String line = "a=b c='123 456' d=777 e='uij yyy'";
Matcher m = Pattern.compile("('[^']*?'|\\S)+").matcher(line);
while (m.find()) {
  System.out.println(m.group()); // or whatever you want to do
}

In other words, find all runs of characters that are combinations of quoted strings or non-space characters; nested quotes are not supported (there is no escape character).

半暖夏伤 2024-08-13 11:40:38
public static void main(String[] args) {
String token;
String value="";
HashMap<String, String> attributes = new HashMap<String, String>();
String line = "a=b c='123  456' d=777 e='uij yyy'";
StringTokenizer tokenizer = new StringTokenizer(line," ");
while(tokenizer.hasMoreTokens()){
        token = tokenizer.nextToken();
    value = token.contains("'") ? value + " " + token : token ;
    if(!value.contains("'") || value.endsWith("'")) {
           //Split the strings and get variables into hashmap 
           attributes.put(value.split("=")[0].trim(),value.split("=")[1]);
           value ="";
    }
}
    System.out.println(attributes);
}

输出:
{d=777, a=b, e='uij yyy', c='123 456'}

在这种情况下,连续空格将被截断为值中的单个空格。
这里属性哈希图包含值

public static void main(String[] args) {
String token;
String value="";
HashMap<String, String> attributes = new HashMap<String, String>();
String line = "a=b c='123  456' d=777 e='uij yyy'";
StringTokenizer tokenizer = new StringTokenizer(line," ");
while(tokenizer.hasMoreTokens()){
        token = tokenizer.nextToken();
    value = token.contains("'") ? value + " " + token : token ;
    if(!value.contains("'") || value.endsWith("'")) {
           //Split the strings and get variables into hashmap 
           attributes.put(value.split("=")[0].trim(),value.split("=")[1]);
           value ="";
    }
}
    System.out.println(attributes);
}

output:
{d=777, a=b, e='uij yyy', c='123 456'}

In this case continuous space will be truncated to single space in the value.
here attributed hashmap contains the values

幽梦紫曦~ 2024-08-13 11:40:38
 import java.io.*;
 import java.util.Scanner;

 public class ScanXan {
  public static void main(String[] args) throws IOException {

    Scanner s = null;

    try {
        s = new Scanner(new BufferedReader(new FileReader("<file name>")));

        while (s.hasNext()) {
            System.out.println(s.next());
           <write for output file>
        }
    } finally {
        if (s != null) {
            s.close();
        }
    }
 }
}
 import java.io.*;
 import java.util.Scanner;

 public class ScanXan {
  public static void main(String[] args) throws IOException {

    Scanner s = null;

    try {
        s = new Scanner(new BufferedReader(new FileReader("<file name>")));

        while (s.hasNext()) {
            System.out.println(s.next());
           <write for output file>
        }
    } finally {
        if (s != null) {
            s.close();
        }
    }
 }
}
枉心 2024-08-13 11:40:38
java.util.StringTokenizer tokenizer = new java.util.StringTokenizer(line, " ");
while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    int index = token.indexOf('=');
    String key = token.substring(0, index);
    String value = token.substring(index + 1);
}
java.util.StringTokenizer tokenizer = new java.util.StringTokenizer(line, " ");
while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    int index = token.indexOf('=');
    String key = token.substring(0, index);
    String value = token.substring(index + 1);
}
够运 2024-08-13 11:40:38

您是否尝试过按“=”拆分并从每对结果数组中创建一个标记?

Have you tried splitting by '=' and creating a token out of each pair of the resulting array?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文