当前位置：文江博客话题详情

erlang tokenize

标记带引号的字符串

发布于 2024-11-27 20:35:23 字数 1353 浏览 2 评论 0 原文

我正在尝试标记字符串。只要没有引号字符，一切都很好：

string:tokens ("abc def ghi", " ").
["abc","def","ghi"]

但是 string:tokens/2 确实对带引号的字符串有很大帮助。它的行为符合预期：

string:tokens ("abc \"def xyz\" ghi", " ").
["abc","\"def","xyz\"","ghi"]

我需要一个函数，它接受要标记化的字符串、分隔符和引号字符。就像：

tokens ("abc \"def xyz\" ghi", " ", "\"").
["abc","def xyz","ghi"]

现在，在我开始重新发明轮子之前，我的问题是：

标准库中是否有这样的函数或类似的函数？

编辑：

好的，我编写了自己的实现，但我仍然对原始问题的答案非常感兴趣。到目前为止我的代码如下：

tokens (String) -> tokens (String, [], [] ).

tokens ( [], Tokens, Buffer) ->
    lists:map (fun (Token) -> string:strip (Token, both, $") end, Tokens ++ [Buffer] );

tokens ( [Character | String], Tokens, Buffer) ->
    case {Character, Buffer} of
        {$ , [] } -> tokens (String, Tokens, Buffer);
        {$ , [$" | _] } -> tokens (String, Tokens, Buffer ++ [Character] );
        {$ , _} -> tokens (String, Tokens ++ [Buffer], [] );
        {$", [] } -> tokens (String, Tokens, "\"" );
        {$", [$" | _] } -> tokens (String, Tokens ++ [Buffer ++ "\""], [] );
        {$", _} -> tokens (String, Tokens ++ [Buffer], "\"");
        _ -> tokens (String, Tokens, Buffer ++ [Character] )
    end.

原文

I am trying to tokenize strings. As long as there are no quoting characters all is well:

string:tokens ("abc def ghi", " ").
["abc","def","ghi"]

But string:tokens/2 does help me much with quoted strings. It behaves as expected:

string:tokens ("abc \"def xyz\" ghi", " ").
["abc","\"def","xyz\"","ghi"]

What I need is a function that takes a string to be tokenized, a delimiter and a quote character. Something like:

tokens ("abc \"def xyz\" ghi", " ", "\"").
["abc","def xyz","ghi"]

Now before I start reinventing the wheel, my question is:

Is there such a function or a similar one in the standard libs?

EDIT:

OK, I wrote my own implementation, but I am still highly interested in answers to the original question. Here goes my code so far:

tokens (String) -> tokens (String, [], [] ).

tokens ( [], Tokens, Buffer) ->
    lists:map (fun (Token) -> string:strip (Token, both, $") end, Tokens ++ [Buffer] );

tokens ( [Character | String], Tokens, Buffer) ->
    case {Character, Buffer} of
        {$ , [] } -> tokens (String, Tokens, Buffer);
        {$ , [$" | _] } -> tokens (String, Tokens, Buffer ++ [Character] );
        {$ , _} -> tokens (String, Tokens ++ [Buffer], [] );
        {$", [] } -> tokens (String, Tokens, "\"" );
        {$", [$" | _] } -> tokens (String, Tokens ++ [Buffer ++ "\""], [] );
        {$", _} -> tokens (String, Tokens ++ [Buffer], "\"");
        _ -> tokens (String, Tokens, Buffer ++ [Character] )
    end.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

昔日梦未散 2024-12-04 20:35:23

如果正则表达式在一般情况下是可以接受的，您可以使用：

> re:split("abc \"def xyz\" ghi", " \"|\" ", [{return, list}]).
["abc","def xyz","ghi"]

如果您想根据任何空格而不仅仅是空格进行分割，也可以使用 "\s\"|\"\s" 。

如果您碰巧从输入文件中解析此内容，您可能需要使用 strip_split/2 >字符串。

If regular expressions are acceptable in the general case you can use:

> re:split("abc \"def xyz\" ghi", " \"|\" ", [{return, list}]).
["abc","def xyz","ghi"]

You can also use "\s\"|\"\s" if you want to split based on any whitespace instead of just spaces.

If you happen to be parsing this from an input file, you may want to use strip_split/2 from estring.

回复收藏 0 原文

∞梦里开花 2024-12-04 20:35:23

string:tokens ("abc \"def ghi\" foo.bla", " .\""). 将在空格、点和双引号上标记字符串。结果：[" abc", "def", "ghi", "foo", "bla"]。如果你想保留引用的部分，你可能需要考虑创建一个 Token/Lexer，因为正则表达式不是很好在这项工作中。

回复收藏 0 原文

她如夕阳 2024-12-04 20:35:23

您可以使用 re 模块。它带有 split/3 函数。例如：

re:split("abc \"def xyz \"ghi", "[(\s\")\s\"]", [{return, list}]).
["abc",[],"def","xyz",[],"ghi"]

第二个参数是正则表达式（您可能需要调整我的示例以删除空列表...）

You could use the re module. It comes with a split/3 function. For eg :

re:split("abc \"def xyz \"ghi", "[(\s\")\s\"]", [{return, list}]).
["abc",[],"def","xyz",[],"ghi"]

The second argument is a regular expression (you might have to tweak my example to remove the empty lists...)

回复收藏 0 原文

青丝拂面 2024-12-04 20:35:23

这大约是我的写法（未经测试！）：

tokens(String) -> lists:reverse(tokens(String, outside_quotes, [])).

tokens([], outside_quotes, Tokens) ->
  Tokens;
tokens(String, outside_quotes, Tokens) -> 
  {Token, Rest0} = lists:splitwith(fun(C) -> (C != $ ) and (С != $"), String),
  case Rest0 of 
    [] -> [Token | Tokens];
    [$  | Rest] -> tokens(Rest, outside_quotes, [Token | Tokens]);
    [$" | Rest] -> tokens(Rest, inside_quotes, [Token | Tokens])
  end;
tokens(String, inside_quotes, Tokens) -> 
  %% exception on an unclosed quote
  {Token, [$" | Rest]} = lists:splitwith(fun(C) -> С != $", String),
  tokens(Rest, outside_quotes, [Token | Tokens]).

This is approximately how I would write it (not tested!):

tokens(String) -> lists:reverse(tokens(String, outside_quotes, [])).

tokens([], outside_quotes, Tokens) ->
  Tokens;
tokens(String, outside_quotes, Tokens) -> 
  {Token, Rest0} = lists:splitwith(fun(C) -> (C != $ ) and (С != $"), String),
  case Rest0 of 
    [] -> [Token | Tokens];
    [$  | Rest] -> tokens(Rest, outside_quotes, [Token | Tokens]);
    [$" | Rest] -> tokens(Rest, inside_quotes, [Token | Tokens])
  end;
tokens(String, inside_quotes, Tokens) -> 
  %% exception on an unclosed quote
  {Token, [$" | Rest]} = lists:splitwith(fun(C) -> С != $", String),
  tokens(Rest, outside_quotes, [Token | Tokens]).

回复收藏 0 原文

~没有更多了~