标记带引号的字符串

发布于 2024-11-27 20:35:23 字数 1353 浏览 1 评论 0 原文

我正在尝试标记字符串。只要没有引号字符,一切都很好:

string:tokens ("abc def ghi", " ").
["abc","def","ghi"]

但是 string:tokens/2 确实对带引号的字符串有很大帮助。它的行为符合预期:

string:tokens ("abc \"def xyz\" ghi", " ").
["abc","\"def","xyz\"","ghi"]

我需要一个函数,它接受要标记化的字符串、分隔符和引号字符。就像:

tokens ("abc \"def xyz\" ghi", " ", "\"").
["abc","def xyz","ghi"]

现在,在我开始重新发明轮子之前,我的问题是:

标准库中是否有这样的函数或类似的函数?

编辑:

好的,我编写了自己的实现,但我仍然对原始问题的答案非常感兴趣。到目前为止我的代码如下:

tokens (String) -> tokens (String, [], [] ).

tokens ( [], Tokens, Buffer) ->
    lists:map (fun (Token) -> string:strip (Token, both, $") end, Tokens ++ [Buffer] );

tokens ( [Character | String], Tokens, Buffer) ->
    case {Character, Buffer} of
        {$ , [] } -> tokens (String, Tokens, Buffer);
        {$ , [$" | _] } -> tokens (String, Tokens, Buffer ++ [Character] );
        {$ , _} -> tokens (String, Tokens ++ [Buffer], [] );
        {$", [] } -> tokens (String, Tokens, "\"" );
        {$", [$" | _] } -> tokens (String, Tokens ++ [Buffer ++ "\""], [] );
        {$", _} -> tokens (String, Tokens ++ [Buffer], "\"");
        _ -> tokens (String, Tokens, Buffer ++ [Character] )
    end.

I am trying to tokenize strings. As long as there are no quoting characters all is well:

string:tokens ("abc def ghi", " ").
["abc","def","ghi"]

But string:tokens/2 does help me much with quoted strings. It behaves as expected:

string:tokens ("abc \"def xyz\" ghi", " ").
["abc","\"def","xyz\"","ghi"]

What I need is a function that takes a string to be tokenized, a delimiter and a quote character. Something like:

tokens ("abc \"def xyz\" ghi", " ", "\"").
["abc","def xyz","ghi"]

Now before I start reinventing the wheel, my question is:

Is there such a function or a similar one in the standard libs?

EDIT:

OK, I wrote my own implementation, but I am still highly interested in answers to the original question. Here goes my code so far:

tokens (String) -> tokens (String, [], [] ).

tokens ( [], Tokens, Buffer) ->
    lists:map (fun (Token) -> string:strip (Token, both, $") end, Tokens ++ [Buffer] );

tokens ( [Character | String], Tokens, Buffer) ->
    case {Character, Buffer} of
        {$ , [] } -> tokens (String, Tokens, Buffer);
        {$ , [$" | _] } -> tokens (String, Tokens, Buffer ++ [Character] );
        {$ , _} -> tokens (String, Tokens ++ [Buffer], [] );
        {$", [] } -> tokens (String, Tokens, "\"" );
        {$", [$" | _] } -> tokens (String, Tokens ++ [Buffer ++ "\""], [] );
        {$", _} -> tokens (String, Tokens ++ [Buffer], "\"");
        _ -> tokens (String, Tokens, Buffer ++ [Character] )
    end.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

昔日梦未散 2024-12-04 20:35:23

如果正则表达式在一般情况下是可以接受的,您可以使用:

> re:split("abc \"def xyz\" ghi", " \"|\" ", [{return, list}]).
["abc","def xyz","ghi"]

如果您想根据任何空格而不仅仅是空格进行分割,也可以使用 "\s\"|\"\s"

如果您碰巧从输入文件中解析此内容,您可能需要使用 strip_split/2 >字符串

If regular expressions are acceptable in the general case you can use:

> re:split("abc \"def xyz\" ghi", " \"|\" ", [{return, list}]).
["abc","def xyz","ghi"]

You can also use "\s\"|\"\s" if you want to split based on any whitespace instead of just spaces.

If you happen to be parsing this from an input file, you may want to use strip_split/2 from estring.

∞梦里开花 2024-12-04 20:35:23

string:tokens ("abc \"def ghi\" foo.bla", " .\""). 将在空格、点和双引号上标记字符串。结果:[" abc", "def", "ghi", "foo", "bla"]。如果你想保留引用的部分,你可能需要考虑创建一个 Token/Lexer,因为正则表达式不是很好在这项工作中。

string:tokens ("abc \"def ghi\" foo.bla", " .\""). will tokenize the string on space, point and double quote. Result: ["abc", "def", "ghi", "foo", "bla"]. If you want to preserve the quoted parts, you might want to consider creating a Token/Lexer, because regex is not very good at this work.

她如夕阳 2024-12-04 20:35:23

您可以使用 re 模块。它带有 split/3 函数。例如:

re:split("abc \"def xyz \"ghi", "[(\s\")\s\"]", [{return, list}]).
["abc",[],"def","xyz",[],"ghi"]

第二个参数是正则表达式(您可能需要调整我的示例以删除空列表...)

You could use the re module. It comes with a split/3 function. For eg :

re:split("abc \"def xyz \"ghi", "[(\s\")\s\"]", [{return, list}]).
["abc",[],"def","xyz",[],"ghi"]

The second argument is a regular expression (you might have to tweak my example to remove the empty lists...)

青丝拂面 2024-12-04 20:35:23

这大约是我的写法(未经测试!):

tokens(String) -> lists:reverse(tokens(String, outside_quotes, [])).

tokens([], outside_quotes, Tokens) ->
  Tokens;
tokens(String, outside_quotes, Tokens) -> 
  {Token, Rest0} = lists:splitwith(fun(C) -> (C != $ ) and (С != $"), String),
  case Rest0 of 
    [] -> [Token | Tokens];
    [$  | Rest] -> tokens(Rest, outside_quotes, [Token | Tokens]);
    [$" | Rest] -> tokens(Rest, inside_quotes, [Token | Tokens])
  end;
tokens(String, inside_quotes, Tokens) -> 
  %% exception on an unclosed quote
  {Token, [$" | Rest]} = lists:splitwith(fun(C) -> С != $", String),
  tokens(Rest, outside_quotes, [Token | Tokens]).

This is approximately how I would write it (not tested!):

tokens(String) -> lists:reverse(tokens(String, outside_quotes, [])).

tokens([], outside_quotes, Tokens) ->
  Tokens;
tokens(String, outside_quotes, Tokens) -> 
  {Token, Rest0} = lists:splitwith(fun(C) -> (C != $ ) and (С != $"), String),
  case Rest0 of 
    [] -> [Token | Tokens];
    [$  | Rest] -> tokens(Rest, outside_quotes, [Token | Tokens]);
    [$" | Rest] -> tokens(Rest, inside_quotes, [Token | Tokens])
  end;
tokens(String, inside_quotes, Tokens) -> 
  %% exception on an unclosed quote
  {Token, [$" | Rest]} = lists:splitwith(fun(C) -> С != $", String),
  tokens(Rest, outside_quotes, [Token | Tokens]).
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文