正则表达式麻烦,转义引号
基本上,我正在传递一个字符串,我需要以与 *nix shell 标记命令行选项大致相同的方式对其进行标记
假设我有以下字符串
"Hello\" World" "Hello Universe" Hi
如何将其转换为 3 元素列表
- Hello" World
- 你好宇宙
- 嗨
以下是我的第一次尝试,但它有很多问题
- 它留下了引号字符
- 它没有捕获转义的引号
代码:
public void test() {
String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
List<String> list = split(str);
}
public static List<String> split(String str) {
Pattern pattern = Pattern.compile(
"\"[^\"]*\"" + /* double quoted token*/
"|'[^']*'" + /*single quoted token*/
"|[A-Za-z']+" /*everything else*/
);
List<String> opts = new ArrayList<String>();
Scanner scanner = new Scanner(str).useDelimiter(pattern);
String token;
while ((token = scanner.findInLine(pattern)) != null) {
opts.add(token);
}
return opts;
}
所以以下代码的错误输出是
- “Hello\”
- World
- “”
- Hello
- Universe
- 嗨
编辑我完全接受非正则表达式解决方案这只是我想到的第一个解决方案。
Basically, I'm being passed a string and I need to tokenise it in much the same manner as command line options are tokenised by a *nix shell
Say I have the following string
"Hello\" World" "Hello Universe" Hi
How could I turn it into a 3 element list
- Hello" World
- Hello Universe
- Hi
The following is my first attempt, but it's got a number of problems
- It leaves the quote characters
- It doesn't catch the escaped quote
Code:
public void test() {
String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
List<String> list = split(str);
}
public static List<String> split(String str) {
Pattern pattern = Pattern.compile(
"\"[^\"]*\"" + /* double quoted token*/
"|'[^']*'" + /*single quoted token*/
"|[A-Za-z']+" /*everything else*/
);
List<String> opts = new ArrayList<String>();
Scanner scanner = new Scanner(str).useDelimiter(pattern);
String token;
while ((token = scanner.findInLine(pattern)) != null) {
opts.add(token);
}
return opts;
}
So the incorrect output of the following code is
- "Hello\"
- World
- " "
- Hello
- Universe
- Hi
EDIT I'm totally open to a non regex solution. It's just the first solution that came to mind
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您决定放弃正则表达式并进行解析,则有几种选择。如果您愿意只使用双引号或单引号(但不能同时使用两者)作为引用,那么您可以使用 StreamTokenizer 轻松解决此问题:
如果您必须支持两个引号,这里有一个应该可行的简单实现(警告像 '"blah \" blah"blah' 这样的字符串会产生类似 'blah " blahblah' 的结果,如果这样不行,您将需要进行一些更改):
If you decide you want to forego regex, and do parsing instead, there are a couple of options. If you are willing to have just a double quote or a single quote (but not both) as your quote, then you can use StreamTokenizer to solve this easily:
If you must support both quotes, here is a naive implementation that should work (caveat that a string like '"blah \" blah"blah' will yield something like 'blah " blahblah'. If that isn't OK, you will need to make some changes):
我很确定你不能仅通过对正则表达式进行标记来做到这一点。如果需要处理嵌套和转义分隔符,则需要编写解析器。参见例如 http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
将会有是开源解析器,可以做你想做的事情,尽管我不知道。您还应该查看 StreamTokenizer 类。
I'm pretty sure you can't do this by just tokenising on a regex. If you need to deal with nested and escaped delimiters, you need to write a parser. See e.g. http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
There will be open source parsers which can do what you want, although I don't know any. You should also check out the StreamTokenizer class.
回顾一下,您希望在空格上进行分割,除非用双引号引起来,双引号前面没有反斜杠。
第 1 步:对输入进行标记:
/([ \t]+)|(\\")|(")|([^ \t"]+)/
这将为您提供一系列 SPACE 、ESCAPED_QUOTE、QUOTE 和 TEXT 标记
步骤 2:构建一个匹配标记并做出反应的有限状态机:
状态:START
状态:WITHIN_QUOTES
第 3 步:利润!!
To recap, you want to split on whitespace, except when surrounded by double quotes, which are not preceded by a backslash.
Step 1: tokenize the input:
/([ \t]+)|(\\")|(")|([^ \t"]+)/
This gives you a sequence of SPACE, ESCAPED_QUOTE, QUOTE and TEXT tokens.
Step 2: build a finite state machine matching and reacting to the tokens:
State: START
State: WITHIN_QUOTES
Step 3: Profit!!
我想如果你使用这样的模式:
那么它会给你想要的输出。当我运行你的输入数据时,我得到了这个列表:
我在您自己的问题中使用了
[A-Za-z']+
但不应该只是:[A-Za-z]+
编辑
更改您的
opts.add(token);
行到:I think if you use pattern like this:
Then it will give you desired output. When I ran with your input data I got this list:
I used
[A-Za-z']+
from your own question but shouldn't it be just :[A-Za-z]+
EDIT
Change your
opts.add(token);
line to:您需要做的第一件事就是停止用
split()
来思考这项工作。split()
用于分解简单的字符串,例如this/that/the other
,其中/
始终是分隔符。但是你试图在空格上进行分割,除非空格在引号内,除非如果引号用反斜杠转义(如果反斜杠转义引号,它们可能会转义其他引号)的东西,就像其他反斜杠)。有了所有这些例外,就不可能创建一个正则表达式来匹配所有可能的分隔符,甚至使用像环视、条件、不情愿和所有格量词这样的花哨的花招。您想要做的是匹配标记,而不是分隔符。
在以下代码中,用双引号或单引号括起来的标记可能包含空格以及引号字符(如果前面有反斜杠)。除了括起来的引号之外的所有内容都在组#1(对于双引号标记)或组#2(单引号)中捕获。任何字符都可以用反斜杠转义,即使是在不带引号的标记中; “转义”反斜杠在单独的步骤中被删除。
输出:
这对于基于正则表达式的解决方案来说非常简单,而且它并不能真正处理格式错误的输入,例如不平衡的引号。如果您不熟悉正则表达式,那么最好使用纯手工编码的解决方案,或者更好的是专用的命令行解释器 (CLI) 库。
The first thing you need to do is stop thinking of the job in terms of
split()
.split()
is meant for breaking down simple strings likethis/that/the other
, where/
is always a delimiter. But you're trying to split on whitespace, unless the whitespace is within quotes, except if the quotes are escaped with backslashes (and if backslashes escape quotes, they probably escape other things, like other backslashes).With all those exceptions-to-exceptions, it's just not possible to create a regex to match all possible delimiters, not even with fancy gimmicks like lookarounds, conditionals, reluctant and possessive quantifiers. What you want to do is match the tokens, not the delimiters.
In the following code, a token that's enclosed in double-quotes or single-quotes may contain whitespace as well as the quote character if it's preceded by a backslash. Everything except the enclosing quotes is captured in group #1 (for double-quoted tokens) or group #2 (single-quoted). Any character may be escaped with a backslash, even in non-quoted tokens; the "escaping" backslashes are removed in a separate step.
output:
This is about as simple as it gets for a regex-based solution--and it doesn't really deal with malformed input, like unbalanced quotes. If you're not fluent in regexes, you might be better off with a purely hand-coded solution or, even better, a dedicated command-line interpreter (CLI) library.